Introduction to SQL

Introduction to SQL

Overview

Teaching: 15 min
Exercises: 0 min
Questions
  • What is SQL?

  • Why is SQL significant?

  • What is the relationship between a relational database and SQL?

Objectives
  • Define a relational database.

  • Explain what SQL is and why to use it.

What is SQL?

Structured Query Language, or SQL (sometimes pronounced “sequel”), is a powerful language used to interrogate and manipulate relational databases. It is not a general programming language that you can use to write an entire program. However, SQL queries can be called from programming languages to let any program interact with databases. There are several variants of SQL, but all support the same basic statements that we will be covering today.

Relational databases

Relational databases consist of one or more tables of data. These tables have fields (columns) and records (rows). Every field has a data type. Every value in the same field of each record has the same type. These tables can be linked to each other when a field in one table can be matched to a field in another table. SQL queries are the commands that let you look up data in a database or make calculations based on columns.

Entity relation diagram

Data types help quality control of entries - you will receive an error if you try to enter a word into a field that should contain a number. Understanding the nature of relational databases, and using SQL, will help you in using databases in programming languages such as R or Python.

Why use SQL?

SQL is well established and has been around since the 1970s. It is still widely used in a variety of settings.

SQL lets you keep the data separate from the analysis. There is no risk of accidentally changing data when you are analysing it. If the data is updated, a saved query can be re-run to analyse the new data.

SQL is optimised for handling large amounts of data.

Many web applications (including WordPress and ecommerce sites like Amazon) run on a SQL (relational) database. Understanding SQL is the first step in eventually building custom web applications that can serve data to users.

Database Management Systems

A database management system is a software tool that enables users to manage a database easily. It allows users to access and interact with the underlying data in the database. These actions can range from simply querying data to defining database schemas that fundamentally affect the database structure.

There are a number of different database management systems for working with relational data. We’re going to use SQLite today, but basically everything we teach you will apply to the other database systems as well (e.g., MySQL, PostgreSQL, MS Access, Filemaker Pro). The only things that will differ are the details of exactly how to import and export data and possibly some differences in datatype.

Introduction to DB Browser for SQLite

Let’s all open the database we downloaded via the setup in DB Browser for SQLite.

You can see the tables in the database by looking at the left hand side of the screen under Tables.

To see the contents of a table, click on that table and then click on the Browse Data tab above the table data.

If we want to write a query, we click on the Execute SQL tab.

Dataset Description

The data we will be using consists 4 tables of movie or series titles.

filmsAndSeries

productionCountries

genres

countries

SQL Data Type Quick Reference

The main data types that are used in imdb database are INTEGER and TEXT which define what value the table column can hold.

Different database software/platforms have different names and sometimes different definitions of data types, so you’ll need to understand the data types for any platform you are using. The following table explains some of the common data types and how they are represented in SQLite; more details available on the SQLite website.

Data type Details Name in SQLite
boolean or binary this variable type is often used to represent variables that can only have two values: yes or no, true or false. doesn’t exist - need to use integer data type and values of 0 or 1.
integer sometimes called whole numbers or counting numbers. Can be 1,2,3, etc., as well as 0 and negative whole numbers: -1,-2,-3, etc. INTEGER
float, real, or double a decimal number or a floating point value. The largest possible size of the number may be specified. REAL
text or string and combination of numbers, letters, symbols. Platforms may have different data types: one for variables with a set number of characters - e.g., a zip code or postal code, and one for variables with an open number of characters, e.g., an address or description variable. TEXT
date or datetime depending on the platform, may represent the date and time or the number of days since a specified date. This field often has a specified format, e.g., YYYY-MM-DD doesn’t exist - need to use built-in date and time functions and store dates in real, integer, or text formats. See Section 2.2 of SQLite documentation for more details.
blob a Binary Large OBject can store a large amount of data, documents, audio or video files. BLOB

Key Points

  • SQL is a powerful language used to interrogate and manipulate relational databases.


Selecting and sorting data

Overview

Teaching: 15 min
Exercises: 5 min
Questions
  • What is a query?

  • How do you query databases using SQL?

  • How do you retrieve unique values in SQL?

  • How do you sort results in SQL?

Objectives
  • Understand how SQL can be used to query databases

  • Understand how to build queries, using SQL keywords such as DISTINCT and ORDER BY

What is a query?

A query is a question or request for data. For example, “How many journals does our library subscribe to?” When we query a database, we can ask the same question using a common language called Structured Query Language or SQL in what is called a statement. Some of the most useful queries - the ones we are introducing in this first section - are used to return results from a table that match specific criteria.

Writing my first query

Let’s start by opening DB Browser for SQLite and the imdb database (see Setup). Choose Browse Data and the filmsAndSeries table. The filmsAndSeries table contains columns or fields such as title, type, description, release_year, etc.

Let’s write a SQL query that selects only the title column from the filmsAndSeries table.

SELECT title
FROM filmsAndSeries;

Note the order is important. First SELECT, then FROM.

If we want more information, we can add a new column to the list of fields right after SELECT:

SELECT Title, release_year
FROM filmsAndSeries;

Or we can select all of the columns in a table using the wildcard *.

SELECT *
FROM filmsAndSeries;

Capitalization and good style

In the first query above, we have capitalized the words SELECT and FROM because they are SQL keywords. Even though capitalization makes no difference to the SQL interpreter, capitalization of these SQL terms helps for readability and is therefore considered good style. As you write and expand your own queries, it might be helpful to pick an option, such as CamelCase, and use that style when naming tables and columns. Some tables and columns require capitalization and some do not. An occasional change of capitalization for these table and column names may be needed.

Example:

SELECT title
FROM filmsAndSeries;

instead of

SELECT Title
FROM filmsandseries;

Comments

When the queries become more complex, it is good style to add comments to explain to yourself, or to others, what you are doing with your query. Comments help explain the logic of a section and provide context for anyone reading the query. It’s essentially a way of making notes within your SQL. In SQL, comments begin using – and end at the end of the line. To mark a whole paragraph as a comment, you can enclose it with the characters /* and */. For example, a commented version of the above query can be written as:

-- We are only interested in the title
SELECT title
-- which we can find in the table filmsAndSeries
FROM filmsAndSeries

Unique values

There may be a situation when you need to retrieve unique records and not multiple duplicate records. The SQL DISTINCT keyword is used after SELECT to eliminate duplicate records and fetch only unique records. Let’s return all of the unique titles in a SQL query.

SELECT DISTINCT title
FROM filmsAndDatabases;

Note, some database systems require a semicolon ; after each SQL statement. If we select more than one column, then the distinct pairs of values are returned.

SELECT DISTINCT title, release_year
FROM filmsAndSeries;

Sorting

We can also sort the results of our queries by using the keyword ORDER BY. Let’s create a query that sorts the articles table in ascending order by release_year using the ASC keyword in conjunction with ORDER BY.

SELECT *
FROM filmsAndSeries
ORDER BY release_year ASC;

The keyword ASC tells us to order it in ascending order. Instead, we can use DESC to get the descending order sorting by tmdb_score.

SELECT *
FROM filmsAndSeries
ORDER BY tmdb_score DESC;

ASC is the default, so by omitting ASC or DESC, SQLite will sort ascending (ASC).

We can also sort on several fields at once, in different directions. For example, we can order by tmdb_score descending and then release_year ascending in the same query.

SELECT *
FROM filmsAndSeries
ORDER BY tmdb_score DESC, release_year ASC;

Challenge

Write a query that returns Title, imdb_score, release_year and imdb_votes from the filmsAndSeries table, ordered by the number of votes with the highest number first and alphabetically by title.

Solution

SELECT title, imdb_score, release_year, imdb_votes
FROM filmsAndSeries
ORDER BY imdb_votes DESC, title ASC;

Key Points

  • SQL is ideal for querying databases

  • SQL queries have a basic query structure starting with SELECT field FROM table with additional keywords and criteria that can be used.


Filtering

Overview

Teaching: 20 min
Exercises: 10 min
Questions
  • How can I filter data?

Objectives
  • Write queries that SELECT data based on conditions, such as AND, OR, and NOT.

  • Understand how to use the WHERE clause in a statement.

  • Learn how to use comparison keywords such as LIKE in a statement.

Filtering

SQL is a powerful tool for filtering data in databases based on a set of conditions. Let’s say we only want data for a specific year. To filter by a year, we will use the WHERE clause.

SELECT *
FROM filmsAndSeries
WHERE release_year = 2020;

We can add additional conditions by using AND, OR, and/or NOT. For example, suppose we want the films with more than 500000 votes:

SELECT *
FROM filmsAndSeries
WHERE type="MOVIE" AND imdb_votes > 500000;

If we want to get data for movies from two years, we can combine the tests using OR:

SELECT *
FROM filmsAndSeries
WHERE type = "MOVIE" AND (release_year = 2020 OR release_year = 2022);

Or we could use IN:

SELECT *
FROM filmsAndSeries
WHERE type = "MOVIE" AND release_year IN (2020 ,2022);

When you do not know the entire value you are searching for, you can use comparison keywords such as LIKE, IN, BETWEEN...AND, IS NULL. For instance, we can use LIKE in combination with WHERE to search for data that matches a pattern.

For example, using the filmsAndSeries table again, let’s SELECT all of the data WHERE the title contains “Monty Python”:

SELECT *
FROM filmsAndSeries
WHERE title LIKE '%monty python%';

You may have noticed the wildcard character %. It is used to match zero to many characters. So in the SQL statement above, it will match zero or more characters before and after ‘Monty Python’.

Let’s see what variations of the term we got. Notice uppercase and lowercase, the addition of ‘s’ at the end of structures, etc.

To learn more about other comparison keywords you can use, see Beginner SQL Tutorial on SQL Comparison Keywords.

Challenge

Write a query that returns the title, description, release_year for all movies and series that have “monty python” in the title or description.

Solution

SELECT title, description, release_year
FROM filmsAndSeries
WHERE title LIKE "%Monty Python%" OR description like "%Monty Python%" ;

You can continue to add or chain conditions together and write more advanced queries.

Key Points

  • Use WHERE to filter and retrieve data based on specific conditions.

  • Use AND, OR, and NOT to add additional conditions.

  • Use the comparison keyword LIKE and wildcard characters such as % to match patterns.


Aggregating & calculating values

Overview

Teaching: 15 min
Exercises: 5 min
Questions
  • How can we aggregate values in SQL?

  • Can SQL be used to make calculations?

Objectives
  • Use SQL functions like AVG in combination with clauses like Group By to aggregate values.

  • Make calculations on fields using SQL.

Functions

SQL contains functions which allow you to make calculations on data in your database. Some of the most common functions are MAX, MIN, AVG, COUNT, SUM, and they will: MAX (find the maximum value in a field), MIN (find the minimum value in a field), AVG (find the average value of a field), COUNT (count the number of values in a field and present the total), and SUM (add up the values in a field and present the sum).

For instance, we can find the number of items in a table with this query:

SELECT COUNT(*)
FROM filmsAndSeries

Aggregation

Let’s say we wanted to get the average tmdb_score and imdb_score for each release_year. We can use AVG and the GROUP BY clause in a query:

SELECT release_year, AVG(imdb_score), AVG(tmdb_score)
FROM filmsAndSeries
GROUP BY release_year;

GROUP BY is used by SQL to aggregate identical data into groups. In this case, we are aggregating all the scores by years. Then we use AVG to get the averages of imdb_score and tmdb_score. This process of aggregation allows us to combine results by grouping records based on value and calculating combined values in groups.

As you can see, it is difficult to tell which year has the highest average score and the least. We can improve upon the query above by using ORDER BY and DESC.

SELECT release_year, AVG(imdb_score), AVG(tmdb_score)
FROM filmsAndSeries
GROUP BY release_year 
ORDER BY AVG(imdb_score) DESC;

Challenge

Write a query using an aggregate function that returns the number of movies per year, sorted by title count in descending order. Which year has the most titles? (Hint to choosing which aggregate function to use - it is one of the common aggregate functions MAX, MIN, AVG, COUNT, SUM.)

Solution

SELECT release_year, COUNT(title)
FROM filmsAndSeries
WHERE type = 'MOVIE'
GROUP BY release_year
ORDER BY COUNT(Title) DESC;

The HAVING keyword

SQL offers a mechanism to filter the results based on aggregate functions, through the HAVING keyword.

For example, we can adapt the last request we wrote to only return information about release_year with 100 or more movies:

SELECT release_year, COUNT(*)
FROM filmsAndSeries
WHERE type = 'MOVIE'
GROUP BY release_year
HAVING COUNT(title) >= 100
ORDER BY COUNT(title) DESC;

The HAVING keyword works exactly like the WHERE keyword, but uses aggregate functions instead of database fields. When you want to filter based on an aggregation like MAX, MIN, AVG, COUNT, SUM, use HAVING. To filter based on the individual values in a database field, use WHERE.

Note that HAVING comes after GROUP BY. One way to think about this is: the data are retrieved (SELECT), can be filtered (WHERE), then joined in groups (GROUP BY), finally we only select some of these groups (HAVING).

Challenge

Write a query that returns year and average imdb_score from the filmsAndSeries table, the average imdb_score for each release year but only for the years with an average score større end eller lig med 5.

Solution

SELECT release_year, AVG(imdb_score)
FROM filsAndSeries
GROUP BY release_year
HAVING AVG(imdb_score)>=5;

Calculations

In SQL, we can also perform calculations as we query the database. Also known as computed columns, we can use expressions on a column or multiple columns to get new values during our query. For example, what if we wanted to calculate a new column called score_difference:

SELECT title, release_year, imdb_score - tmdb_score AS score_difference
FROM filmsAndSeries
ORDER BY imdb_score - tmdb_score DESC;

In section 6. Joins and aliases we are going to learn more about the SQL keyword AS and how to make use of aliases - in this example we simply used the calculation and AS to represent that the new column is different from the original SQL table data.

We can use any arithmetic operators (like +, -, *, /, square root SQRT or the modulo operator %) if we would like.

Key Points

  • Queries can do arithmetic operations on field values.


Joins and aliases

Overview

Teaching: 25 min
Exercises: 20 min
Questions
  • How do I join two tables if they share a common point of information?

  • How can I use aliases to improve my queries?

Objectives
  • Understand how to link tables together via joins.

  • Understand when it is valuable to use aliases or shorthand.

Joins

The SQL JOIN clause allows us to combine columns from one or more tables in a database by using values common to each. It follows the FROM clause in a SQL statement. We also need to tell the computer which columns provide the link between the two tables using the word ON.

Let’s start by joining data from the filmAndSeries table with the genres table. The id columns in both these tables links them.

SELECT *
FROM filmsAndSeries
JOIN genres
ON filmsAndSeries.id  = genres.id;

ON is similar to WHERE, it filters things out according to a test condition. We use the table.colname format to tell the SQL manager what column in which table we are referring to.

Alternatively, we can use the word USING, as a short-hand. In this case we are telling DB Browser that we want to combine filmsAndSeries with genres and that the common column is id.

SELECT *
FROM filmsAndSeries
JOIN genres
USING (id);

When joining tables, you can specify the columns you want by using table.colname instead of selecting all the columns using *. For example:

SELECT filmsAndSeries.title, genres.genre, filmsAndSeries.release_year
FROM filmsAndSeries
JOIN genres
ON filmsAndSeries.id = genres.id;

Joins can be combined with sorting, filtering, and aggregation. So, if we wanted the average scores for movies in each genre, we can use the following query:

SELECT genres.genre,  ROUND(AVG(filmsAndSeries.imdb_score), 2), ROUND(AVG(filmsAndSeries.tmdb_score), 2)
FROM genres
JOIN filmsAndSeries
USING (id)
WHERE type = "MOVIE"
GROUP BY genres.genre;

The ROUND function allows us to round the score number returned by the AVG function by 2 decimal places.

Challenge

Write a query that JOINS the genres and filmsAndSeries tables and that returns the genre, total number of series and average number seasons for every genre.

Solution

SELECT genres.genre, count(*), avg(filmsAndSeries.seasons)
FROM filmsAndSeries
JOIN genres
USING (id)
GROUP BY genres.genre;

You can also join multiple tables. For example:

SELECT filmsAndSeries.title, productionCountries.country, genres.genre
FROM filmsAndSeries
JOIN genres
ON filmsAndSeries.id = genres.id
JOIN productionCountries
ON productionCountries.id = genres.id;

Challenge:

Write a query that returns the title, number of production countries, number of genres, ordered by number of production countries in descending order.

Solution

SELECT filmsAndSeries.title, COUNT(DISTINCT productionCountries.Country), COUNT(DISTINCT genres.genre)
FROM filmsAndSeries
JOIN productionCountries
USING (id)
JOIN genres
USING (id)
GROUP BY title
ORDER BY COUNT(DISTINCT Country) DESC;

There are different types of joins which you can learn more about at SQL Joins Explained.

Aliases

As queries get more complex, names can get long and unwieldy. To help make things clearer we can use aliases to assign new names to items in the query.

We can alias both table names:

SELECT fas.title, g.genre
FROM filmsAndSeries AS fas
JOIN genres  AS g
ON fas.id = g.id;

And column names:

SELECT fas.title AS title, pc.Country AS country
FROM filmsAndSeries AS fas
JOIN productionCountries  AS pc
ON fas.id = pc.id;
WHERE country = "Denmark"

The AS isn’t technically required, so you could do:

SELECT fas.Title t
FROM filmsAndSeries fas;

But using AS is much clearer so it is good style to include it.

Key Points

  • Joining two tables in SQL is an good way to analyse datasets, especially when both datasets provide partial answers to questions you want to ask.

  • Creating aliases allows us to spend less time typing, and more time querying!


Saving queries

Overview

Teaching: 20 min
Exercises: 10 min
Questions
  • How can I save a query for future use?

  • How can I remove a saved query?

Objectives
  • Learn how to save repeated queries as ‘Views’ and how to drop them.

Saving queries for future use

It is not uncommon to repeat the same operation more than once, for example for monitoring or reporting purposes. SQL comes with a very powerful mechanism to do this: views. Views are queries saved in the database. You query it as a (virtual) table that is populated every time you query it.

Creating a view from a query requires you to add CREATE VIEW viewname AS before the query itself. For example, if we want to save the query giving the number of productions from each country in a view, we can write:

CREATE VIEW production_counts AS
SELECT countries.country, COUNT(*)
FROM productionCountries
JOIN countries
ON countries.code = productionCountries.Country
GROUP BY countries.Country;

Now, we will be able to access these results with a much shorter notation:

SELECT *
FROM production_counts;

Assuming we do not need this view anymore, we can remove it from the database.

DROP VIEW production_counts;

In DBBrowser for SQLite, you can also create a view from any query by omitting the CREATE VIEW viewname AS statement and instead, clicking the small Save icon at the top of the Execute SQL tab and then clicking Save as view. Whatever method you use to create a view, it will appear in the list of views under the Database Structure tab.

Challenge

Write a CREATE VIEW query that JOINS the filmsAndSeries table with the productionCountries table on id and returns the AVG of imdb_score grouped by the Country in DESC order.

Solution

CREATE VIEW scores_by_country AS
SELECT countries.country, AVG(filmsAndSeries.imdb_score) as score
FROM productionCountries
JOIN filmsAndSeries
USING (id)
JOIN countries
ON countries.code = productionCountries.Country
GROUP BY countries.country
ORDER BY score DESC

Key Points

  • Saving queries as ‘Views’ allows you to save time and avoid repeating the same operation more than once.


Database design

Overview

Teaching: 25 min
Exercises: 20 min
Questions
  • What is database design?

Objectives
  • Use Entity Relationship Diagrams to visualise and structure your data.

Spreadsheets

Spreadsheets are often created to keep lists of a variety of things like an inventory of equipment or statistics. Spreadsheets are an easy way to display data organized in columns and rows. Column headers describe the data contained in corresponding columns. Each row is a record (sometimes called an observation) with data about it contained in separate column cells.

Spreadsheets can make data gathering easier but they can also lead to messy data. Over time, if you gather enough data in spreadsheets, you will likely end up with inconsistent data.

Designing a relational database for your data can help reduce the places where these errors can be introduced. You can also use SQL queries to find these issues and address them across your entire dataset. Before you can take advantage of all of these tools, you need to design your database.

Database Design

Database design involves a model or plan developed to determine how the data is stored, organized and manipulated. The design addresses what data will be stored, how they will be classified, and the interrelationships between the data across different tables in the database.

Terminology

In the Introduction to SQL lesson, we introduced the terms “fields”, “records”, and “values”. These terms are commonly used in databases while the “columns”, “rows”, and “cells” terms are more common in spreadsheets. Fields store a single kind of information (text, integers, etc.) related to one topic (title, year, etc.), while records are a set of fields containing specific values related to one item in your database (a movie, a country, etc.).

To design a database, we must first decide what kinds of things we want to represent as tables. A table is the physical manifestation of a kind of “entity”. An entity is the conceptual representation of the thing we want to store informtation about in the database, with each row containing information about one entity. An entity has “attributes” that describe it, represented as fields. For example, a movie or a series is an entity. Attributes would be things like the title, or scores which would appear as fields.

To create relationships between tables later on, it is important to designate one column as a primary key. A primary key, often designated as PK, is one attribute of an entity that distinguishes it from the other entities (or records) in your table. The primary key must be unique for each row for this to work. A common way to create a primary key in a table is to make an ‘id’ field that contains an auto-generated integer that increases by 1 for each new record. This will ensure that your primary key is unique.

It is useful to describe on an abstract level the entities we would like to capture, along with how the different entities are related to each other. We do this using and entity relationship diagram (ER diagram or ERD).

Entity Relationship Diagram (ER Diagram or ERD)

ERDs are helpful tools for visualising and structuring your data more efficiently. They allow you to map relationships between concepts and ultimately construct a relational database. The following is an ERD of the database used in this lesson:

Relationships between entities and their attributes are represented by lines linking them together. For example, the line linking filmsAndSeries and genres is interpreted as follows: The ‘filmsAndSeries’ entity is related to the ‘genres’ entity through the attributes ‘id’ in both tables.

Conceptually, we know that a movie or series can have several genres and each genre will have many series and movies. This is known as a many-to-many relationship.

More Terminology

The degree of relationship between entities is known as their ‘cardinality’. Using the filmsAndSeries - genres example, the ‘filmsAndSeries’ table contains a primary key (PK) called ‘id’. When the PK is used to create a connection between the original table and a different table, it is called a foreign key (FK) in the other table. To follow the example, we see a field in the ‘genres’ table called id that contains the values from the ‘id’ field in the ‘filmsAndSeries’ table, connected the two tables.

There are 4 main types of relationships between tables:

In our example of the ‘id’ field in the ‘genres’ or ‘productionCountries’ tables has many to many relationship with the filmsAndSeries table. The ‘code’ in the ‘countries’ table has a one to one relationship with the ‘country’ field in the ‘productionCountries’ field.

A key attribute is often included when designing databases to facilitate joins.

Normalisation

ERDs are helpful in normalising your data which is a process that can be used to create tables and establish relationships between those tables with the goal of eliminating redundancy and inconsistencies in the data.

In the example ERD above, creating a separate table for countries and linking to it from the productionCountries table via PK and FK identifiers allows us to normalise the data and avoid inconsistencies. If we used one table, we could introduce country name errors such as misspellings or alternate names.

There are a number of normal forms in the normalisation process that can help you reduce redundancy in database tables. Study Tonight features tutorials where you can learn more about them.

Identifying remaining inconsistencies in the ERD

Are there other tables and relationships you can create to further normalise the data and avoid inconsistencies?

Answers

  1. Seperate tables for movies and series would solve the issue that the seasons field is only used for series.
  2. A parentalGuidance table would solve the issue, that only some titles has an associated age_certification. Can you spot anything else?

Additional database design tutorials to consult from Lucidchart:

Key Points

  • Database design is helpful for creating more efficient databases.


Creating tables and modifying data

Overview

Teaching: 15 min
Exercises: 10 min
Questions
  • How can I create, modify, and delete tables and data?

Objectives
  • Write statements that create tables.

  • Write statements to insert, modify, and delete records.

So far we have only looked at how to get information out of a database, both because that is more frequent than adding information, and because most other operations only make sense once queries are understood. If we want to create and modify data, we need to know two other sets of commands.

The first pair are CREATE TABLE and DROP TABLE. While they are written as two words, they are actually single commands. The first one creates a new table; its arguments are the names and types of the table’s columns. For example, the following statement creates the table filmsAndSeries:


CREATE TABLE "filmsAndSeries" (	"id" TEXT, "title" TEXT, "type" TEXT, "description" TEXT, "release_year" INTEGER, "age_certification" TEXT, "runtime" INTEGER, "seasons" INTEGER, "imdb_score" REAL, "imdb_votes" INTEGER, "tmdb_popularity" REAL, "tmdb_score" REAL)

We can get rid of one of our tables using:

DROP TABLE filmsAndSeries;

Be very careful when doing this: if you drop the wrong table, hope that the person maintaining the database has a backup, but it’s better not to have to rely on it.

We talked about data types earlier in Introduction to SQL: SQL Data Type Quick Reference.

When we create a table, we can specify several kinds of constraints on its columns. For example, a better definition for the filmsAndSeries table would be:

CREATE TABLE "filmsAndSeries" (
	"id"	TEXT NOT NULL PRIMARY KEY,
	"title"	TEXT,
	"type"	TEXT,
	"description"	TEXT,
	"release_year"	INTEGER,
	"age_certification"	TEXT,
	"runtime"	INTEGER,
	"seasons"	INTEGER,
	"imdb_score"	REAL,
	"imdb_votes"	INTEGER,
	"tmdb_popularity"	REAL,
	"tmdb_score"	REAL,
)

Once again, exactly what constraints are available and what they’re called depends on which database manager we are using.

Once tables have been created, we can add, change, and remove records using our other set of commands, INSERT, UPDATE, and DELETE.

Here is an example of inserting rows into the filmsAndSeries table:


INSERT INTO "filmsAndSeries" VALUES ("tm84618",  "Taxi Driver",  "MVIE",	"A mentally unstable Vietnam War...",	        1976,	"R",	    114,NULL,	    8.2,	808582,     40.965,	8.179)
INSERT INTO "filmsAndSeries" VALUES ("tm154986",  "Deliverance",  "MOVIE",	"Intent on seeing the Cahulawassee...",	      1972,	"R",	      109,NULL,	    7.7,	107673,     10.01,	7.3)
INSERT INTO "filmsAndSeries" VALUES ("ts22164",  "Monty Python's Flying Circus",  "SHOW",	"A British sketch comedy...",	  1969,	"TV-14",	30,	4.0,	8.8,	73424,      17.617, 8.306)
INSERT INTO "filmsAndSeries" VALUES ("tm120801",  "The Dirty Dozen",	"MOVIE",	"12 American military prisoners in ...",	      1967,NULL , 	      150,NULL,		  7.7,	72662,	    20.398,	7.6)



We can also insert values into one table directly from another:

CREATE TABLE "myMovies"(title TEXT, description TEXT, year INTEGER );
INSERT INTO "myMovies" SELECT title, description, release_year FROM filmsAndSeries;

Modifying existing records is done using the UPDATE statement. To do this we tell the database which table we want to update, what we want to change, the values for any or all of the fields, and under what conditions we should update the values.

For example, we made a typo when entering the type of the first INSERT statement above, we can correct it with an update:

UPDATE filmsAndSeries SET type = "MOVIE"  WHERE id = "tm84618";

Be careful, do not forget the WHERE clause or the update statement will modify all of the records in the database.

Deleting records can be a bit trickier, because we have to ensure that the database remains internally consistent. If all we care about is a single table, we can use the DELETE command with a WHERE clause that matches the records we want to discard. We can remove the movie Deliverance from the filmsAndSeries table like this:

DELETE FROM filmsAndSeries WHERE title = 'Deliverance';

Exercise

Write an SQL statement to add the country “Gibraltar” (code: GI) to the table countries.

Solution

INSERT INTO "countries" VALUES ("GI", "Gibraltar");

Backing Up with SQL

SQLite has several administrative commands that aren’t part of the SQL standard. One of them is .dump, which prints the SQL commands needed to re-create the database. Another is .read, which reads a file created by .dump and restores the database.

Key Points

  • Use CREATE and DROP to create and delete tables.

  • Use INSERT to add data.

  • Use UPDATE to modify existing data.

  • Use DELETE to remove data.

  • It is simpler and safer to modify data when every record has a unique primary key.

  • Do not create dangling references by deleting records that other records refer to.


Other database tools

Overview

Teaching: 5 min
Exercises: 0 min
Questions
  • Are there other database tools that I can use besides DB Browser and SqliteOnline?

Objectives
  • Point to additional resources on other database tools.

Other database tools

For this lesson, DB Browser, a free open source database tool was used, but there are other tools available, both proprietary and open. A helpful comparison of database tools is available via Wikipedia.

SqliteOnline is the free online database tool offered as an alternative to DB Browser for this lesson. In addition to SqliteOnline, there is data.world which also allows you to work online with SQL and (library) datasets and includes a tutorial. SQL use in Google Sheets, a popular spreadsheet application, is even a possibility, demonstrated via Ben Collin’s tutorial. There is also a business standard software package called, MySQL, which has a community edition that can be downloaded free.

Key Points

  • There are additional database tools beyond DB Browser and SqliteOnline.


Extra challenges (optional)

Overview

Teaching: 0 min
Exercises: 35 min
Questions
  • Are there extra challenges to practice translating plain English queries to SQL queries?

Objectives
  • Extra challenges to practice creating SQL queries.

Extra challenges (optional)

SQL queries help us ask specific questions which we want to answer about our data. The real skill with SQL is to know how to translate our questions into a sensible SQL queries (and subsequently visualise and interpret our results).

Have a look at the following questions; these questions are written in plain English. Can you translate them to SQL queries and give a suitable answer?

Also, if you would like to learn more SQL concepts and try additional challenges, see the Software Carpentry Databases and SQL lesson.

Challenge 1

How many series are there for each genre?

Solution 1

SELECT genre, COUNT( * )
FROM genres
JOIN filmsAndSeries
USING (id)
WHERE type = "SHOW"
GROUP BY genre;

Challenge 2

How many shows ran for only one season? How many ran for 2 seasons? How many 3? etc?

Solution 2

SELECT seasons, COUNT( * )
FROM filmsAndSeries
WHERE type = "SHOW"
GROUP BY seasons

Challenge 3

How many movies or shows are there from each country? Can you make an alias for the number of movies? Can you order the results by number of movies?

Solution 3

SELECT countries.country, COUNT( * ) AS n_movies
FROM productionCountries
JOIN countries
ON countries.code = productionCountries.country
GROUP BY productionCountries.country
ORDER BY n_movies DESC;

Challenge 4

How many titles are there for each age_certification type, and what is the average imdb_score for that age_certification type?

Solution 4

SELECT age_certification, AVG( imdb_score ), COUNT( * )
FROM filmsAndSeries
GROUP BY Licence;

Key Points

  • It takes time and practice to learn how to translate plain English queries into SQL queries.


Good Style

Overview

Teaching: 10 min
Exercises: 0 min
Questions
  • What is good SQL Style, and how can I abide by SQL conventions?

Objectives
  • Understand the foundation for making clean, readable SQL queries.

An Introduction to good style

There are many ways to write an SQL queries, but some look better than others. Abiding by good style guidelines will make your SQL queries easier to read, especially if you are sharing them with others. These are some quick tips for making your SQL look clean.

Pick column names that are precise and short

When choosing column names, it’s important to remember that a large part of what you type in your query will be composed of your column names. Choosing a column name that is one or two words (without any spaces!) will ensure that your queries are easier to type and read. If you include spaces in your column names, you will get an error message when you try to run your queries, so we would recommend using CamelCase, or An_Underscore.

Capitalization (sometimes) matters

In section two, we talked about SQL keywords/commands being case-insensitive (“We have capitalised the words SELECT and FROM because they are SQL keywords. This makes no difference to the SQL interpreter as it is case-insensitive, but it helps for readability and is therefore considered good style.”). But did you know that in some SQL programs, depending on the settings, table and column names are case sensitive? If your query isn’t working, check the capitalization.

Readability

As you may have noticed, we are able to write our query on one line, or on many. The general consensus with SQL is that if you can break it into components on multiple lines, it becomes easier to read. Using multiple lines and indenting, you can turn something that looks like this:

SELECT countries.Country C, filmsAndSeries.title FROM filmsAndSeries JOIN productionCountries USING (id) JOIN countries ON productionCountries.country = countries.code ORDER BY C

Into something that looks like this:

SELECT countries.Country C, filmsAndSeries.title
FROM filmsAndSeries
JOIN productionCountries
USING (id)
JOIN countries
ON productionCountries.country = countries.code
ORDER BY C

In some programs (such as MySQL), there will be tools that can automatically “beautify” your code for better readability.

Key Points

  • There are many ways to write an SQL queries, but some look better than others.