Work with data using queries in Ballerina

This guide walks you through using query expressions (a.k.a. language integrated queries) on data to filter, sort, and join with different data sets to produce new data.

Ballerina has first-class support for writing SQL-like queries to process data. Language-integrated queries can process any Ballerina iterable.

Set up the prerequisites

To run this guide, you need the following prerequisites:

  1. Ballerina 2201.0.0 (Swan Lake) or greater
  2. A text editor

    Tip: Preferably, Visual Studio Code with the Ballerina extension installed.

  3. A command terminal

Create the Ballerina package

Ballerina uses packages to group code. You need to create a Ballerina package and write the business logic in it. In the terminal, execute the command below to create the Ballerina package for the implementation.

Info: For more information on Ballerina packages, see Organize Ballerina code.

$ bal new query_expressions

You view the output below.

Created new package 'query_expressions' at query_expressions.

This creates a directory named query_expressions with the default module along with a sample code for the service as shown below.

.
├── Ballerina.toml
└── main.bal
    
0 directories, 2 files

Create the dataset

To keep things simple, an in-memory table is used to store the COVID-19 dataset. To create the dataset, replace the main.bal file with the code below.

public type CovidEntry record {|
    readonly string iso_code;
    string country;
    decimal cases;
    decimal deaths;
    decimal recovered;
    decimal active;
|};

public final table<CovidEntry> key(iso_code) covidTable = table [
    {iso_code: "AFG", country: "Afghanistan", cases: 159303, deaths: 7386, recovered: 146084, active: 5833},
    {iso_code: "SL", country: "Sri Lanka", cases: 598536, deaths: 5243, recovered: 568637, active: 14656},
    {iso_code: "US", country: "USA", cases: 69808350, deaths: 880976, recovered: 43892277, active: 25035097},
    {iso_code: "IND", country: "India", cases: 80808350, deaths: 980976, recovered: 33892279, active: 35035095}
];

In this code:

  • Each record of type CovidEntry in the table represents the COVID-19 data related to a particular country.
  • The iso_code is used to uniquely identify a country and other fields are self-explanatory.

Filter the data

Create the filterCountriesByCases function

To define a function, which will filter out the records, which have values higher than 100,000 for the cases field, add the code below to the main.bal file.

public function filterCountriesByCases(table<CovidEntry> dataTable, decimal noOfCases) returns string[] {
    string[] filteredCountries = from CovidEntry entry in dataTable
        where entry.cases > noOfCases
        select entry.country;
    return filteredCountries;
}

In this code:

  • The filterCountriesByCases function uses a query expression to iterate the records in the dataTable table and filter only the records, which have more than noOfCases cases. A where clause with a condition is used to filter.
  • All records, which satisfy the condition in the where clause will be selected.

Create the main function

To call the filterCountriesByCases function from inside the main function, add the code below to the main.bal file.

public function main() {
    string[] countries = filterCountriesByCases(covidTable, 10000000);
    io:println("Countries with more than 10 million cases: ", countries);
}

In this code,

  • The filterCountriesByCases function is called and covidTable and 10000000 are provided as parameters so that the function will filter the countries, which have more than 10000000 COVID-19 cases.
  • The next line prints the result of the function.

The complete code of the filtering function

Below is the complete code after adding the filtering function.

public function filterCountriesByCases(table<CovidEntry> dataTable, decimal noOfCases) returns string[] {
    string[] filteredCountries = from CovidEntry entry in dataTable
        where entry.cases > noOfCases
        select entry.country;
    return filteredCountries;
}

public function main() {
    string[] countries = filterCountriesByCases(covidTable, 10000000);
    io:println("Countries with more than 10 million cases: ", countries);
}

Run the package for filtering

In the terminal, navigate to the query_expressions directory, and execute the command below to run the service package.

$ bal run

Info: The console should have warning logs related to the isolatedness of resources. It is a built-in service concurrency safety feature of Ballerina.

You view the output below.

Compiling source
        .../query_expressions:0.1.0

Running executable

Countries with more than 10 million cases: ["USA","India"]

Sort countries by COVID-19 deaths

Create the findCountriesByHighestNoOfDeaths function

To define a new function to find the top three countries with the highest number of COVID-19 deaths, add the code below to the main.bal file.

public function findCountriesByHighestNoOfDeaths(table<CovidEntry> dataTable, int n) returns [string, decimal][] {
    [string, decimal][] countriesWithDeaths = from CovidEntry entry in dataTable
        order by entry.deaths descending
        limit n
        select [entry.country, entry.deaths];
    return countriesWithDeaths;
}

In this code,

  • You use another query to sort and retrieve a limited number of records from the table.
  • The findCountriesByHighestNoOfDeaths function uses a query expression to find the top three countries with the highest COVID-19 deaths.
  • The order by clause is used to sort the records in the table in descending order and the limit clause to limit the number of output records of the query to n.
  • The query produces an array of tuples of type [string, decimal] as the result. Each tuple contains the country name and the number of reported deaths.
  • The produced array is in descending order by the number of deaths.

Update the main function for sorting

To call the findCountriesByHighestNoOfDeaths function from within the main function to find the top three countries by the number of deaths, add the code below to the main function of the main.bal file.

[string, decimal][] countriesWithDeaths = findCountriesByHighestNoOfDeaths(covidTable, 3);
io:println("Countries with highest deaths:", countriesWithDeaths);

The complete code with the sorting function

Below is the complete code after adding the sorting function.

public function filterCountriesByCases(table<CovidEntry> dataTable, decimal noOfCases) returns string[] {
    string[] filteredCountries = from CovidEntry entry in dataTable
        where entry.cases > noOfCases
        select entry.country;
    return filteredCountries;
}

public function findCountriesByHighestNoOfDeaths(table<CovidEntry> dataTable, int n) returns [string, decimal][] {
    [string, decimal][] countriesWithDeaths = from CovidEntry entry in dataTable
        order by entry.deaths descending
        limit n
        select [entry.country, entry.deaths];
    return countriesWithDeaths;
}

public function main() {
    string[] countries = filterCountriesByCases(covidTable, 10000000);
    io:println("Countries with more than 10 million cases: ", countries);
    [string, decimal][] countriesWithDeaths = findCountriesByHighestNoOfDeaths(covidTable, 3);
    io:println("Countries with highest deaths:", countriesWithDeaths);
}

Run the package for sorting

In the terminal, navigate to the query_expressions directory, and execute the command below to run the service package.

$ bal run

You view the output below.

Compiling source
        .../query_expressions:0.1.0

Running executable

Countries with more than 10 million cases: ["USA","India"]
Countries with highest deaths:[["India",980976],["USA",880976],["Afghanistan",7386]]

Join with another data source

Create the findRecoveredPatientsOfCountries function

Using query expressions, you can join two collections and produce a new collection. The join operation is similar to the SQL join operation. To join the covidTable with a string array, which contains three countries, add the code below to the main.bal file.

public function findRecoveredPatientsOfCountries(table<CovidEntry> dataTable, string[] countries) returns [string, decimal][] {
    [string, decimal][] countriesWithRecovered = from CovidEntry entry in dataTable
        join string country in countries on entry.country equals country
        select [entry.country, entry.recovered];
    return countriesWithRecovered;
}

In this code:

  • The findRecoveredPatientsOfCountries function uses a query expression to join the dataTable table with an array of strings named countries. The join condition is provided after the on keyword.
  • For every record in the dataTable, all the elements in the countries array will be joined.
  • The output array of tuples will have the country and the number of recovered patients only if the condition after the on keyword is satisfied for that particular pair of table records and element of the array being joined.

Update the main function for joining

To call the findRecoveredPatientsOfCountries function at the end to get the number of recovered patients, add the code below to the main function of the main.bal file.

Info: You will get the number of recovered patients in the USA, India, and Afghanistan.

string[] c = ["USA", "India", "Afghanistan"];
[string, decimal][] countriesWithRecovered = findRecoveredPatientsOfCountries(covidTable, c);
io:println("Countries with number of Recovered patients:", countriesWithRecovered);

The complete code with the joining function

Below is the complete code after adding the joining function.

public function filterCountriesByCases(table<CovidEntry> dataTable, decimal noOfCases) returns string[] {
    string[] filteredCountries = from CovidEntry entry in dataTable
        where entry.cases > noOfCases
        select entry.country;
    return filteredCountries;
}

public function findCountriesByHighestNoOfDeaths(table<CovidEntry> dataTable, int n) returns [string, decimal][] {
    [string, decimal][] countriesWithDeaths = from CovidEntry entry in dataTable
        order by entry.deaths descending
        limit n
        select [entry.country, entry.deaths];
    return countriesWithDeaths;
}

public function findRecoveredPatientsOfCountries(table<CovidEntry> dataTable, string[] countries) returns [string, decimal][] {
    [string, decimal][] countriesWithRecovered = from CovidEntry entry in dataTable
        join string country in countries on entry.country equals country
        select [entry.country, entry.recovered];
    return countriesWithRecovered;
}

public function main() {
    string[] countries = filterCountriesByCases(covidTable, 10000000);
    io:println("Countries with more than 10 million cases: ", countries);

    [string, decimal][] countriesWithDeaths = findCountriesByHighestNoOfDeaths(covidTable, 3);
    io:println("Countries with highest deaths:", countriesWithDeaths);

    string[] c = ["USA", "India", "Afghanistan"];
    [string, decimal][] countriesWithRecovered = findRecoveredPatientsOfCountries(covidTable, c);
    io:println("Countries with number of Recovered patients:", countriesWithRecovered);
}

Run the package for joining

In the terminal, navigate to the query_expressions directory, and execute the command below to run the service package.

$ bal run

You view the output below.

Compiling source
        .../query_expressions:0.1.0

Running executable

Countries with more than 10 million cases: ["USA","India"]
Countries with highest deaths:[["India",980976],["USA",880976],["Afghanistan",7386]]
Countries with number of Recovered patients:[["Afghanistan",146084],["USA",43892277],["India",33892279]]

Find discrepancies in the dataset

Create the printErroneousData function

This example shows how you can use the let clause to maintain an intermediate state while iterating a collection using query expression and how to use that intermediate state for further processing. For example, in this dataset, the total number of reported cases should be equal to the sum of the number of deaths, recovered, and active. If they are not equal, an error should have occurred while the dataset is populated.

To define a function called printErroneousData to find any erroneous records in the dataset, add the code below to the main.bal file.

public function printErroneousData(table<CovidEntry> dataTable) {

    string[] countries = from CovidEntry entry in dataTable
        let decimal sum = entry.recovered + entry.deaths + entry.active
        where entry.cases != sum
        select entry.country;

    if countries.length() > 0 {
        io:println("Found erroneous entries for countries: ", countries);
    }
}

In this code:

  • If there is any record in which the number of reported cases is not equal to the sum of recovered, active, and deaths, this function will print it.
  • Here, you use the sum variable to hold the results of intermediate calculations in the query expression.

Note: Even though this particular example uses a separate variable to demonstrate the usage of the let clause, you can do the calculations inline from within the where clause also.

Call the printErroneousData function

To call the findRecoveredPatientsOfCountries function at the end to get the number of recovered patients, add the code below to the main function of the main.bal file.

printErroneousData(covidTable);

The complete code

Below is the complete code after adding the finding function.

public function filterCountriesByCases(table<CovidEntry> dataTable, decimal noOfCases) returns string[] {
    string[] filteredCountries = from CovidEntry entry in dataTable
        where entry.cases > noOfCases
        select entry.country;
    return filteredCountries;
}

public function findCountriesByHighestNoOfDeaths(table<CovidEntry> dataTable, int n) returns [string, decimal][] {
    [string, decimal][] countriesWithDeaths = from CovidEntry entry in dataTable
        order by entry.deaths descending
        limit n
        select [entry.country, entry.deaths];
    return countriesWithDeaths;
}

public function findRecoveredPatientsOfCountries(table<CovidEntry> dataTable, string[] countries) returns [string, decimal][] {
    [string, decimal][] countriesWithRecovered = from CovidEntry entry in dataTable
        join string country in countries on entry.country equals country
        select [entry.country, entry.recovered];
    return countriesWithRecovered;
}

public function printErroneousData(table<CovidEntry> dataTable) {

    string[] countries = from CovidEntry entry in dataTable
        let decimal sum = entry.recovered + entry.deaths + entry.active
        where entry.cases != sum
        select entry.country;

    if countries.length() > 0 {
        io:println("Found erroneous entries for countries: ", countries);
    }
}

public function main() {
    string[] countries = filterCountriesByCases(covidTable, 10000000);
    io:println("Countries with more than 10 million cases: ", countries);

    [string, decimal][] countriesWithDeaths = findCountriesByHighestNoOfDeaths(covidTable, 3);
    io:println("Countries with highest deaths:", countriesWithDeaths);

    string[] c = ["USA", "India", "Afghanistan"];
    [string, decimal][] countriesWithRecovered = findRecoveredPatientsOfCountries(covidTable, c);
    io:println("Countries with number of Recovered patients:", countriesWithRecovered);

    printErroneousData(covidTable);
}

Run the package

In the terminal, navigate to the query_expressions directory, and execute the command below to run the service package.

$ bal run

You view the output below.

Compiling source
        .../query_expressions:0.1.0

Running executable

Countries with more than 10 million cases: ["USA","India"]
Countries with highest deaths:[["India",980976],["USA",880976],["Afghanistan",7386]]
Countries with number of Recovered patients:[["Afghanistan",146084],["USA",43892277],["India",33892279]]
Found erroneous entries for countries: ["Sri Lanka","India"]