This chapter covers the match/2, match/3, and match/4 predicates for regular expression operations in the AWK target.
The AWK target provides regex pattern matching through dedicated predicates:
| Predicate | Purpose |
|---|---|
match(String, Pattern) |
Boolean match test |
match(String, Pattern, Captures) |
Match with capture groups |
match(String, Pattern, Captures, Type) |
Match with capture and regex type |
The simplest form tests whether a string matches a pattern.
% Match email pattern
valid_email(Email) :-
user(_, Email),
match(Email, "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}").
% Match phone number
valid_phone(Phone) :-
contact(_, Phone),
match(Phone, "^[0-9]{3}-[0-9]{3}-[0-9]{4}$").
Generated AWK:
{
Email = $2
if (Email ~ /[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/) {
print Email
}
}
% Starts with
starts_with_a(Name) :-
person(Name),
match(Name, "^A").
% Ends with
ends_with_son(Name) :-
person(Name),
match(Name, "son$").
% Contains
contains_error(Line) :-
log_line(Line),
match(Line, "ERROR").
% Word boundary (approximate in AWK)
word_match(Text) :-
document(Text),
match(Text, "(^|[^a-zA-Z])test([^a-zA-Z]|$)").
Extract matched substrings using capture groups.
% Extract username from email
extract_user(Email, Username) :-
contact(_, Email),
match(Email, "([^@]+)@", [Username]).
Generated AWK (using match() function):
{
Email = $2
if (match(Email, /([^@]+)@/, captures)) {
Username = captures[1]
print Username
}
}
% Parse date: YYYY-MM-DD
parse_date(DateStr, Year, Month, Day) :-
record(_, DateStr),
match(DateStr, "([0-9]{4})-([0-9]{2})-([0-9]{2})", [Year, Month, Day]).
% Parse log entry: [LEVEL] message
parse_log(Line, Level, Message) :-
log(Line),
match(Line, "\\[([A-Z]+)\\] (.*)", [Level, Message]).
% Parse version: major.minor[.patch]
parse_version(Ver, Major, Minor, Patch) :-
software(_, Ver),
match(Ver, "([0-9]+)\\.([0-9]+)(\\.([0-9]+))?", [Major, Minor, _, Patch]).
Specify the regex dialect explicitly.
| Type | Description |
|---|---|
auto |
AWK’s native regex (default) |
ere |
Extended Regular Expressions |
bre |
Basic Regular Expressions |
awk |
AWK-specific regex |
% Explicit ERE
match_ere(Text, Pattern, Captures) :-
input(Text),
match(Text, Pattern, Captures, ere).
% Explicit BRE (requires different escaping)
match_bre(Text, Pattern, Captures) :-
input(Text),
match(Text, Pattern, Captures, bre).
% ERE: Standard extended regex (most common)
email_ere(Email, User, Domain) :-
record(Email),
match(Email, "([^@]+)@(.+)", [User, Domain], ere).
% AWK: AWK's native (similar to ERE)
email_awk(Email, User, Domain) :-
record(Email),
match(Email, "([^@]+)@(.+)", [User, Domain], awk).
In Prolog strings, backslashes need double escaping:
% Match literal dot
dot_match(Text) :-
input(Text),
match(Text, "\\."). % Matches literal "."
% Match digits
digit_match(Text) :-
input(Text),
match(Text, "[0-9]+"). % Character class
% Match whitespace
space_match(Text) :-
input(Text),
match(Text, "[ \\t]+"). % Space or tab
| Prolog | Regex | Matches |
|---|---|---|
\\. |
\. |
Literal dot |
\\\\ |
\\ |
Literal backslash |
\\[ |
\[ |
Literal bracket |
\\( |
\( |
Literal parenthesis |
\\t |
Tab | Tab character |
\\n |
Newline | Newline character |
% Parse Apache log format
apache_log(Line, IP, Method, Path, Status) :-
log_line(Line),
match(Line,
"([0-9.]+) .* \"([A-Z]+) ([^ ]+) .*\" ([0-9]+)",
[IP, Method, Path, Status]).
% Extract error messages
error_message(Line, ErrorType, Message) :-
log_line(Line),
match(Line, "ERROR: ([A-Za-z]+): (.*)", [ErrorType, Message]).
% Validate IP address (simplified)
valid_ip(IP) :-
network(IP),
match(IP, "^[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}$").
% Validate UUID
valid_uuid(UUID) :-
record(UUID),
match(UUID, "^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$").
% Validate credit card (simplified)
valid_card(Card) :-
payment(Card),
match(Card, "^[0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{4}$").
% Extract URL components
parse_url(URL, Protocol, Host, Path) :-
link(URL),
match(URL, "^([a-z]+)://([^/]+)(/.*)?$", [Protocol, Host, Path]).
% Extract domain from URL
domain(URL, Domain) :-
link(URL),
match(URL, "://([^/:]+)", [Domain]).
% Extract quoted strings
quoted_text(Line, Text) :-
document(Line),
match(Line, "\"([^\"]*)\"", [Text]).
% Extract bracketed content
bracketed(Line, Content) :-
document(Line),
match(Line, "\\[([^\\]]*)\\]", [Content]).
% Extract key=value pairs
key_value(Line, Key, Value) :-
config(Line),
match(Line, "^([^=]+)=(.*)$", [Key, Value]).
To match lines that DON’T match a pattern:
% Lines without errors
clean_lines(Line) :-
log(Line),
\+ match(Line, "ERROR|WARN|FATAL").
Generated AWK:
{
Line = $0
if (Line !~ /ERROR|WARN|FATAL/) {
print Line
}
}
% Valid emails from active users
active_valid_email(Email) :-
user(Name, Email, Status),
Status = "active",
match(Email, "[a-z]+@[a-z]+\\.[a-z]+").
% Count error types
error_count(ErrorType, N) :-
log(Line),
match(Line, "ERROR: ([A-Za-z]+):", [ErrorType]),
N = 1.
?- compile_predicate_to_awk(error_count/2, [aggregation(count)], AWK).
% file: regex_matching.pl
:- encoding(utf8).
:- use_module('src/unifyweaver/targets/awk_target').
% Boolean match - valid emails
valid_email(Email) :-
contact(_, Email, _),
match(Email, "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}").
% Capture - parse email
email_parts(Email, User, Domain) :-
contact(_, Email, _),
match(Email, "([^@]+)@(.*)", [User, Domain]).
% Parse date
date_parts(DateStr, Year, Month, Day) :-
record(_, DateStr, _),
match(DateStr, "([0-9]{4})-([0-9]{2})-([0-9]{2})", [Year, Month, Day]).
% Parse log entry
log_entry(Line, Level, Timestamp, Message) :-
log_line(Line),
match(Line, "\\[([A-Z]+)\\] ([0-9:T-]+) (.*)", [Level, Timestamp, Message]).
% Filter errors only
errors_only(Line, Message) :-
log_line(Line),
match(Line, "\\[ERROR\\] [^ ]+ (.*)", [Message]).
% Generate scripts
generate :-
compile_predicate_to_awk(valid_email/1, [], AWK1),
write_awk_script('valid_email.awk', AWK1),
compile_predicate_to_awk(email_parts/3, [], AWK2),
write_awk_script('email_parts.awk', AWK2),
compile_predicate_to_awk(date_parts/4, [], AWK3),
write_awk_script('date_parts.awk', AWK3),
compile_predicate_to_awk(log_entry/4, [], AWK4),
write_awk_script('log_entry.awk', AWK4),
compile_predicate_to_awk(errors_only/2, [], AWK5),
write_awk_script('errors_only.awk', AWK5),
format('Generated regex AWK scripts~n').
:- initialization(generate, main).
Test data (contacts.tsv):
Alice alice@example.com active
Bob bob.smith@company.org active
Carol invalid-email inactive
Dave dave@test.io active
Test data (logs.txt):
[INFO] 2025-01-15T10:30:00 Server started
[ERROR] 2025-01-15T10:31:00 Connection failed
[WARN] 2025-01-15T10:32:00 High memory usage
[ERROR] 2025-01-15T10:33:00 Database timeout
Run:
swipl regex_matching.pl
echo "Filtering valid emails:"
awk -f valid_email.awk contacts.tsv
echo "Parsing email parts:"
awk -f email_parts.awk contacts.tsv
echo "Extracting errors:"
awk -f errors_only.awk logs.txt
In this chapter, you learned:
match/2match/3match/4In Chapter 7, we’ll bring everything together with practical applications: log analysis, data transformation, and pipeline integration.