SQL String Contains in Postgres, MySQL, SQLite, BigQuery
LearnDataSci is reader-supported. When you purchase through links on our site, earned commissions help support our team of writers, researchers, and designers at no extra cost to you.
For this article, we used a table containing Netflix titles. You can find it here on Kaggle.
For your reference, here's what the first five rows of the data looks like:
show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | s1 | Movie | Dick Johnson Is Dead | Kirsten Johnson | NaN | United States | September 25, 2021 | 2020 | PG-13 | 90 min | Documentaries | As her father nears the end of his life, filmm... |
1 | s2 | TV Show | Blood & Water | NaN | Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban... | South Africa | September 24, 2021 | 2021 | TV-MA | 2 Seasons | International TV Shows, TV Dramas, TV Mysteries | After crossing paths at a party, a Cape Town t... |
2 | s3 | TV Show | Ganglands | Julien Leclercq | Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi... | NaN | September 24, 2021 | 2021 | TV-MA | 1 Season | Crime TV Shows, International TV Shows, TV Act... | To protect his family from a powerful drug lor... |
3 | s4 | TV Show | Jailbirds New Orleans | NaN | NaN | NaN | September 24, 2021 | 2021 | TV-MA | 1 Season | Docuseries, Reality TV | Feuds, flirtations and toilet talk go down amo... |
4 | s5 | TV Show | Kota Factory | NaN | Mayur More, Jitendra Kumar, Ranjan Raj, Alam K... | India | September 24, 2021 | 2021 | TV-MA | 2 Seasons | International TV Shows, Romantic TV Shows, TV ... | In a city of coaching centers known to train I... |
A Quick and Easy Way
An easy way to check with SQL if a string contains a particular phrase is by using the LIKE
clause in conjunction with the '%'
wildcard, which acts as a way to match characters around the target word. For example, we can do this as follows:
title | |
---|---|
0 | Madness in the Desert |
1 | Six Windows in the Desert |
2 | Who Would You Take to a Deserted Island? |
3 | Jalpari: The Desert Mermaid |
4 | Queen of the Desert |
5 | The Desert Bride |
Notice that the '%'
wildcard matched "desert" in many positions. Other than LIKE
, this article covers various SQL options developers can use to check if a string contains a substring.
A few simple use cases for wanting to find a substring include:
- Identifying different variants of a product e.g., blue T-shirt small, blue T-shirt medium
- Searching user messages/posts for a particular phrase or word
- Finding and replacing substrings when cleaning data
We'll start by covering the LIKE
operator in more detail, then look at Regular Expressions (Regex) and the INSTR
function. Our examples will target the PostgreSQL dialect, but we've included a list of SQL dialects before each feature to show if it's supported and if the approach needs to be varied.
Option 1: The LIKE
clause
Main SQL dialects supported
SQL Dialect | Variation |
---|---|
PostgreSQL | WHERE value LIKE pattern |
MYSQL | WHERE value LIKE pattern |
SQLite | WHERE value LIKE pattern |
BigQuery | WHERE value LIKE pattern |
T-SQL | WHERE value LIKE pattern |
The LIKE
clause compares two string values. It's usually combined with the '%'
and '_'
wildcards to search for substrings. We'll return to the '_'
wildcard later, but '%'
can be used loosely in the following manners:
Wildcard usage | Outcome |
---|---|
'%substring%' | Match values containing 'substring' |
'substring%' | Match values that start with 'substring' |
'%substring' | Match values that end with 'substring' |
Match values containing a substring
title | |
---|---|
0 | Mama Drama |
1 | Abla Fahita: Drama Queen |
2 | Total Drama |
3 | Dramaworld |
Match values that start with a substring
title | |
---|---|
0 | Dramaworld |
Match values that end with a substring
title | |
---|---|
0 | Mama Drama |
1 | Total Drama |
Adding in the '_'
operator
It's less common, but you can leverage the '_'
wildcard to enhance your LIKE
usage. '_'
represents a single character, which allows you to make your LIKE
comparisons more specific. For example, you can use '_'
in the following fashions:
Wildcard usage | Outcome |
---|---|
'_substring%' | Match values where 'substring' starts from the second character |
'substring_%' | Match values that start with 'substring' and have at least one additional character |
'substring__%' | Match values that start with 'substring' and have at least two additional characters |
As mentioned, this is rare, but it does allow you to make more bespoke substring searches. Following the above examples, see below for a demonstration of how we could add in the '_'
wildcard to find all values that contain a substring but don't end with it:
title | |
---|---|
0 | Abla Fahita: Drama Queen |
1 | Dramaworld |
Remember that you don't need to restrict the usage of '%'
and '_'
to what's shown here. These wildcards can be stringed together in many ways to help pinpoint much more specific patterns.
Option 2: Regex matching
Main SQL dialects supported
SQL Dialect | Variation |
---|---|
PostgreSQL | REGEXP_MATCH(column, pattern) |
MYSQL | column REGEXP pattern |
SQLite | column REGEXP pattern |
BigQuery | REGEXP_CONTAINS(column, pattern) |
While slightly more complicated, regex is often the go-to approach for substring searching. Regex is used to search for patterns ('%substring%'
could be defined as a simple pattern), but it allows for heavy customization - this can be ideal when working with messy text data.
While regex can be complex, using it to search for substrings is relatively simple:
title | |
---|---|
0 | Batman: The Killing Joke |
1 | Lego DC Comics: Batman Be-Leaguered |
Regex 101 is a great place to experiment with regex. You can apply your regex patterns to test strings to determine what's getting matched. See below for an example of how Regex 101 was used to test the usage of the \\D
character class, which corresponds to anything that isn't a digit.
Option 3: The INSTR
function
Main SQL dialects supported
SQL Dialect | Variation |
---|---|
MySQL | INSTR(column, pattern) |
SQLite | INSTR(column, pattern) |
BigQuery | CONTAINS_SUBSTR(column, pattern) |
title | |
---|---|
0 | Fresh, Fried & Crispy |
1 | Judah Friedlander: America Is the Greatest Cou... |
Performance comparison
Using Python combined with duckdb
, we can run some speed tests on the approaches described above to compare their performances.
type | milliseconds | |
---|---|---|
1 | Regex | 0.000015 |
2 | INSTR | 0.000015 |
0 | LIKE clause | 5.511349 |
These results show that regex is the fastest, outpacing the LIKE
clause by a reasonable amount. INSTR is also very quick, suggesting it could be utilizing a regex search under the hood.
Remember that the data table used in this article is relatively small-scale; performance could vary based on table size and the type of database used.
Summary
In SQL, the simplest way to see if a column contains a substring is by using the LIKE
clause in combination with wildcards. Alternatively, most SQL dialects include functionality for regex matching.
While more complex, we'd recommend using Regex, which offers more customization.
Regex skills can be massively helpful when working with text data - it does much more than just substring searching, but this is a good starting point if you want to start leveraging regex. As a bonus, it usually comes with a performance boost!