A resource for Heathmont College students studying VCE Applied Computing - Data Analytics Unit 3 & 4.
Understanding databases is crucial for managing and analyzing data. This section covers relational database concepts, SQL commands, and database management.
SELECT
, FROM
, WHERE
, JOIN
, ORDER BY
, GROUP BY
, INSERT
, UPDATE
, DELETE
, CREATE TABLE
, ALTER TABLE
.Key Knowledge:
Key Skills:
Study design key knowledge and key skills are taken verbatim from the VCE Applied Computing Study Design 2025-2028 © Victorian Curriculum and Assessment Authority. For current versions and related content visit www.vcaa.vic.edu.au
The information about design tools covers both spreadsheets and databases, so the information for the key knowledge point “design tools for representing databases and spreadsheets,” has been included in the page for Spreadsheets and Data Cleansing. Click this link to go there now..
While databases may be used in the Unit 4 Outcome 1 SAT task, the key knowledge and key skills do not directly mention database use.
A database is an organized collection of data, typically stored and accessed electronically from a computer system. Relational Databases are databases that model the data in tables with rows and columns. Each table represents an entity, and each row represents a record. The columns represent attributes or fields. In relational databases, tables are linked using foreign key to primary key relationships.
A primary key is a unique identifier for each record in a table. It must contain unique values and cannot be NULL. A foreign key is a field in a table that links to the primary key in another table. This establishes a relationship between those two tables.
For example, in an employees
table, the employee_id
could be the primary key. In a departments
table, the department_id
could be the primary key. If each employee belongs to a department, the department_id
in the employees
table would be a foreign key linking to the department_id
in the departments
table.
This means that we can retrieve all the information about each employees department (location, department name, phone number, manager, etc.) without needing to rewrite that information for each employee in the department. This is an important feature of relational databases.
SQL database supports a variety of data types, including:
INTEGER
: A whole numberDECIMAL
, FLOAT
or DOUBLE
: A decimal number. Different database systems use one or more of these names, sometimes with differing levels of precision.VARCHAR(n)
: A variable-length string with a maximum length of n
(sometimes just called TEXT
)DATE
: A date valueTIME
: A time valueBOOLEAN
: A true/false valueData types are especially important in SQL because they help ensure data integrity and accuracy. Databases will not allow data to be saved in a table unless the data type matches the column definition. This helps prevent errors and ensures that the data is consistent, but it also means that databases need to be carefully designed.
Note that the implementation of date/time data types can vary between database systems. For example, some databases store date and time together in a single data type, while others store them separately.
SQLite
, you can use TEXT
to store date and time values in the format YYYY-MM-DD HH:MM:SS
. SQLite then provides functions to manipulate these values as dates and times.DuckDB
, you can use DATE
and TIME
data types to store date and time values separately.PostgreSQL
, you can use TIMESTAMP
to store date and time values together.As well as defining the data type of a column, you can also define the size of the column. This is particularly relevant in text fields, where you can specify the maximum number of characters that can be stored in the column. For example, a VARCHAR(100)
column can store up to 100 characters. This ensures that data is stored efficiently and that the database is not storing more data than is necessary. It also helps to prevent errors where data is too long to be stored in a column.
SQL (Structured Query Language) is the standard language for relational database management systems. It is used to perform tasks such as querying data, updating data, and creating databases and tables. While most database systems have tiny differences in their implementation of SQL, it is almost entirely standardized across platforms.
SQL allows you to perform a wide range of operations on a database, including selecting data, aggregating data, updating data, and deleting data from database tables. The SQL commands we study in Data Analytics are:
SELECT
: Retrieve data from a databaseFROM
: Specify the table to retrieve data fromWHERE
: Filter data based on a conditionJOIN
: Combine rows from two or more tables based on a related column between themORDER BY
: Sort the result set in ascending or descending orderGROUP BY
: Group rows that have the same values into summary rowsINSERT
: Insert new records into a tableUPDATE
: Modify existing records in a tableDELETE
: Remove records from a tableSome additional commands that are useful to learn are:
CREATE TABLE
: Create a new table in the databaseALTER TABLE
: Modify an existing table in the databaseDROP TABLE
: Delete a table from the databaseCREATE SEQUENCE
: Create a sequence in the databaseSUM()
, COUNT()
, AVG()
, MIN()
, MAX()
: Aggregate functions for summarizing dataHAVING
: Filter data after grouping has been performedSELECT
statements are the most common SQL commands. The basic syntax for a SELECT
statement is:
SELECT column1, column2, ...
FROM table_name
WHERE condition;
This statement selects the specific columns (from the column list) from the table (specified by table_name
) where the condition is true. The WHERE
clause is optional, but it is used to filter the rows returned by the query.
SELECT
Statement Examplesgames
table:SELECT *
FROM games;
games
table:SELECT game_id, title, release_date
FROM games;
games
table where the release_date
is after 2020:SELECT *
FROM games
WHERE release_date > '2020-01-01';
Joins are used to combine rows from two or more tables based on a related column between them. There are different types of joins:
INNER JOIN
**`: Returns rows when there is a match in both tables.LEFT JOIN
(or LEFT OUTER JOIN
): Returns all rows from the left table and the matched rows from the right table.RIGHT JOIN
(or RIGHT OUTER JOIN
): Returns all rows from the right table and the matched rows from the left table.FULL JOIN
(or FULL OUTER JOIN
): Returns rows when there is a match in one of the tables.In Data Analytics we primarily focus on INNER JOIN
to combine rows from two tables where there is a match between the columns in both tables.
Consider two tables, employees
and departments
, with the following columns:
employees
table:
employee_id
first_name
last_name
department_id
departments
table:
department_id
department_name
location
To retrieve the first_name
, last_name
, and department_name
of each employee, you could use an INNER JOIN
:
SELECT employees.first_name, employees.last_name, departments.department_name
FROM employees
INNER JOIN departments
ON employees.department_id = departments.department_id;
SELECT * FROM students WHERE age = 17;
SELECT name, age FROM students ORDER BY age DESC;
INSERT INTO students (first_name, last_name, age, year_level) VALUES ('John', 'Doe', 17, 11);
UPDATE students SET year_level = 12 WHERE student_id = 131;
CREATE TABLE students (
student_id INTEGER PRIMARY KEY,
first_name VARCHAR(100),
last_name VARCHAR(100),
age INTEGER
);
Creating queries can be a complex endeavour, especially once you get past the level of single table queries with one condition. It becomes very important to thoroughly test your queries to ensure they are returning correct results.
WHERE
conditions with values at the boundaries.One way (and the primary way in Applied Computing) of recording your test results is through the use of a test table. Test tables have four columns:
Consider that Roger has been asked to retrieve the names of all students younger than 17. He writes the following query:
SELECT first_name, last_name
FROM students
WHERE age <= 17;
To test this query, Roger would create a test table like this:
Feature Tested | Test Data | Expected Results | Actual Results |
---|---|---|---|
Students younger than 17 | age = 17. Table has 2 17 year old students | No students should be returned | Returned two students whose age was exactly 17 |
Students younger than 17 | age = 16. Table has 3 16 year old students | All students aged 16 should be returned. | Returned three students aged 16 |
Students older than 17 | age = 18. | No students should be returned | No students were returned |
Students aged 0 | age = 0 An additional student with an age of 0 was added for testing | 1 student aged 0 should be returned | 1 student was returned |
Students aged 100 | age = 100 An additional student with an age of 100 was added for testing | No students should be returned | No students were returned |
The test table has helped Roger to identify that his query is not returning the correct results for the first test. He can now go back and modify his query to ensure it is returning the correct results.
Validation is a core part of any computing system. We validate data to ensure that it is reasonable and appropriate for the work at hand. Validation does not verify that the data is correct, but rather that it is reasonable.
Across Applied Computing we have three common types of validation that we perform:
Each of these forms of validation look a little different in a database.
To perform an existence check in the database, we are usually checking that data is present. In SQL this is done by using an IS NOT NULL
or IS NULL
statement. For example, to check that a student has a first name, you might use the following query:
SELECT *
FROM students
WHERE first_name IS NULL;
Any rows returned would indicate student records in our database without a first name, which indicates a likely error.
Existence checks can also be performed by the primary key to foreign key relationships. For example, if we have a students
table and a classes
table, the foreign key constraint will make sure we only add students to classes that exist in our classes
table.
Type checks are a little more complex in SQL. SQL databases are designed to store data in a specific format, and so the database will prevent you from storing data in the wrong format. One instance where you may be required to perform type checking would be in the case of an automatically import process. In that scenario we would:
Range checks are relatively simple in an SQL database, as performing queries that filter based on a condition is a core part of SQL. For example, to check that all students are between the ages of 4 and 18, you might use the following query:
SELECT *
FROM students
WHERE age < 4 OR age > 18;
This would return any student records with an age outside of the expected range.
Why do we need a primary key in relational database tables?
What is the purpose of a foreign key in a relational database?
Describe the difference between an INTEGER
data type and a DECIMAL
or FLOAT
data type in SQL. Give an example of when you might use each.
Hamsa is going to store the abbreviated state code (e.g., “VIC”, “NSW”, “WA”) in a database table. She is considering using a VARCHAR(3)
data type for this column. Is this a good choice? Why or why not?
Write an SQL query to select the first name, last name, and age of all students from a students
table where the age is greater than 15.
What is the difference between an INNER JOIN
and a LEFT JOIN
in SQL?
Describe the purpose of the GROUP BY
clause in SQL. Give an example of when you might use it.
What SQL command would you use to update someone’s email address in a users
table?
Delia is importing a large CSV file into a database table from a repository of weather data. The CSV has the following columns: date
, temperature
, humidity
, and wind_speed
. She wants to ensure that the data is stored correctly in the database. Describe how she could use each of the validation types on this data.
Explain the difference between the WHERE
and HAVING
clauses in SQL.
Michael has created a table called games
with the following columns: game_id
, title
, release_date
, and genre
. He has set the data type of title, release_date and genre to VARCHAR(400)
. After testing, he is confident that the database will function correctly. Why might Michael still want to reconsider the design of his database? Give two reasons.
Describe the results of the following SQL query:
SELECT department_name, COUNT(employee_id)
FROM departments
INNER JOIN employees
ON departments.department_id = employees.department_id
GROUP BY department_name;
Write a proposed test table (without the “Actual Results”) for the query used in Question 12. Include tests for departments with no employees, departments with multiple employees, and departments with only one employee.