Python Regular Expressions (RegEx)¶
Introduction to Regular Expressions¶
Regular Expressions (RegEx) are sequences of characters that define a search pattern. They are widely used in text processing tasks such as searching, replacing, and extracting specific patterns from strings.
In Python, the
re
module provides built-in support for working with Regular Expressions.
Why Use RegEx?¶
- Pattern Matching: Check if a string follows a specific format.
- Text Searching: Locate specific words or patterns in a document.
- Data Validation: Ensure input follows a defined structure (e.g., email format).
- Text Extraction: Extract useful information from structured/unstructured text.
Python's re
Module¶
- Python provides the
re
module, which contains functions for handling Regular Expressions.
import re
import re
# Sample dataset labels
labels = [
"AI_Vision_Model",
"AI_NLP_Model",
"AI_Generative_Model",
"ML_Regression_Model",
"AI_Chatbot_System"
]
# Regular expression pattern to match labels
# ^AI_ -> Ensures the label starts with "AI_"
# .* -> Matches any characters in between
# _Model$ -> Ensures the label ends with "_Model"
# re.search(pattern, label) -> Checks if each label matches the pattern.
pattern = r"^AI_.*_Model$"
# Checking each label against the pattern
for label in labels:
if re.search(pattern, label):
print(f"Match: {label}")
else:
print(f"No Match: {label}")
Match: AI_Vision_Model Match: AI_NLP_Model Match: AI_Generative_Model No Match: ML_Regression_Model No Match: AI_Chatbot_System
RegEx Functions¶
Function | Description |
---|---|
findall() |
Returns a list of all matches in a string. |
search() |
Returns a Match object for the first match found. |
split() |
Splits a string at each match and returns a list. |
sub() |
Replaces occurrences of a pattern with a specified string. |
1. findall() – Extracting All Matches¶
- The
findall()
function returns all occurrences of a pattern in the given text as a list. If no match is found, it returns an empty list.
import re
# Sample AI-related text
text = "Machine Learning, Deep Learning, and AI are evolving rapidly."
# Regular expression pattern to find occurrences of "Learning"
pattern = r"\bLearning\b" # \b ensures exact word match
# Extract all occurrences of the pattern
matches = re.findall(pattern, text)
print("Matches found:", matches)
### Output: Matches found: ['Learning', 'Learning']
Matches found: ['Learning', 'Learning']
Handling No Matches¶
import re
text = "Neural networks power modern AI models."
# Trying to find the word "Learning"
pattern = r"\bLearning\b"
matches = re.findall(pattern, text)
# Checking if any matches were found
if matches:
print("Matches found:", matches)
else:
print("No matches found.")
### Output: No matches found.
No matches found.
2. search() – Finding the First Match¶
- The
search()
function returns only the first occurrence of a pattern in the text. If no match is found, it returnsNone
.
import re
# Sample dataset text
text = "AI models have 98% accuracy, and 87% of users prefer AI-based tools."
# Regular expression pattern to find the first number
pattern = r"\d+"
# Searching for the first number in the text
match = re.search(pattern, text)
if match:
print("First number found:", match.group()) # Extract the matched number
else:
print("No number found in the text.")
### Output: First number found: 98
First number found: 98
3. split() – Splitting Text Based on a Pattern¶
- The
split()
function divides a string at each match of the pattern and returns a list of substrings.
import re
# Sample text
text = "AI is powerful. It transforms industries! Can we keep up?"
# Regular expression pattern to split sentences using punctuation
pattern = r"[.!?]"
# Splitting the text
sentences = re.split(pattern, text)
# Removing empty strings from the list
sentences = [s.strip() for s in sentences if s.strip()]
print("Split sentences:", sentences)
### Output: Split sentences: ['AI is powerful', 'It transforms industries', 'Can we keep up']
Split sentences: ['AI is powerful', 'It transforms industries', 'Can we keep up']
Limiting the Number of Splits¶
- You can control how many times the split happens using the
maxsplit
parameter
text = "AI is evolving fast. New models emerge daily."
# Split only at the first punctuation mark
sentences = re.split(r"[.!?]", text, maxsplit=1)
print("Limited split:", sentences)
### Output: Limited split: ['AI is evolving fast', ' New models emerge daily.']
Limited split: ['AI is evolving fast', ' New models emerge daily.']
4. sub() – Replacing Text Using a Pattern¶
- The
sub()
function replaces occurrences of a pattern with a specified string.
import re
# Sample AI-related text
text = "Deep Learning and AI models are advancing."
# Regular expression pattern to replace "AI" and "Deep Learning" with "ML"
pattern = r"(AI|Deep Learning)"
# Replacing occurrences with "ML"
updated_text = re.sub(pattern, "ML", text)
print("Updated text:", updated_text)
### Output: Updated text: ML and ML models are advancing.
Updated text: ML and ML models are advancing.
Limiting the Number of Replacements¶
- You can specify a
count
parameter to limit the number of replacements.
text = "AI is evolving. AI is everywhere."
# Replace only the first occurrence of "AI"
updated_text = re.sub(r"AI", "ML", text, count=1)
print("Updated text:", updated_text)
### Output: Updated text: ML is evolving. AI is everywhere.
Updated text: ML is evolving. AI is everywhere.
RegEx Match Object¶
A Match Object is an object that stores details about a successful match when using RegEx functions like
search()
. It contains useful properties and methods to retrieve information such as:Match position (start and end index)
Original search string
Extracted matched text
Note: If no match is found, None
is returned instead of a Match
object.
Why Use Match Objects?¶
- Extract Key Information – Get specific parts of a matched pattern.
- Identify Match Positions – Locate where patterns occur in a string.
- Retrieve Matched Text – Extract meaningful data from unstructured text.
- Differentiate
match()
andsearch()
– Understand their behavior in pattern matching.
Using search()
to Get a Match Object¶
- The
search()
function returns a Match Object if the pattern is found.
import re
# Sample text
text = "Natural Language Processing (NLP) and Deep Learning are part of AI."
# Regular expression pattern to find "AI"
pattern = r"AI"
# Searching for the first occurrence of "AI"
match = re.search(pattern, text)
# Printing the Match Object
print("Match Object:", match) # Output: <re.Match object; span=(64, 66), match='AI'>
Match Object: <re.Match object; span=(64, 66), match='AI'>
Match Object Methods¶
1. .span()
– Get Start and End Position of the Match¶
- This method returns a tuple representing the start and end positions of the match.
import re
# Sample text
text = "Machine Learning and AI are transforming industries."
# Searching for "AI"
match = re.search(r"AI", text)
if match:
print("Match Position (Start, End):", match.span()) # Output: (21, 23)
### Output: Match Position (Start, End): (21, 23)
Match Position (Start, End): (21, 23)
2. .string
– Get the Original Search String¶
- This property returns the entire string that was passed into the function.
import re
# Sample text
text = "Data Science and AI are shaping the future."
# Searching for "AI"
match = re.search(r"AI", text)
if match:
print("Original String:", match.string)
### Output: Original String: Data Science and AI are shaping the future.
Original String: Data Science and AI are shaping the future.
3. .group()
– Get the Matched Text¶
- This method returns the actual text that matched the pattern.
import re
# Sample text
text = "Generative AI is revolutionizing content creation."
# Searching for "AI"
match = re.search(r"AI", text)
if match:
print("Matched Text:", match.group()) # Output: AI
### Output: Matched Text: AI
Matched Text: AI
Difference Between match()
and search()
¶
Function | Behavior |
---|---|
match() |
Checks for a match only at the beginning of the string. |
search() |
Checks for a match anywhere in the string. |
- In below case, both
match()
andsearch()
return the same result because"AI"
is at the beginning of the string. - If
"AI"
were not at the start,match()
would returnNone
, whilesearch()
would still find it.
import re
text = "AI is changing the world of technology."
# Using match() – Only matches if "AI" is at the beginning
match_start = re.match(r"AI", text)
print("match() result:", match_start) # Output: <re.Match object; span=(0, 2), match='AI'>
# Using search() – Matches "AI" anywhere in the string
match_anywhere = re.search(r"AI", text)
print("search() result:", match_anywhere) # Output: <re.Match object; span=(0, 2), match='AI'>
match() result: <re.Match object; span=(0, 2), match='AI'> search() result: <re.Match object; span=(0, 2), match='AI'>
RegEx Metacharacters¶
- Metacharacters are special symbols in Regular Expressions (RegEx) that help define search patterns for string manipulation. These characters have special meanings and allow flexible pattern matching in Python.
Common Metacharacters in RegEx¶
| Metacharacter | Description | Example |
|--------------|----------------|----------------|
| []
| Matches any one character from a set | "[a-m]"
(Matches letters from 'a' to 'm') |
| \
| Indicates a special sequence or escapes a special character | "\d"
(Matches any digit) |
| .
| Matches any character except newline | "In..nsity"
(Matches "Intensity", "In99nsity") |
| ^
| Matches if string starts with a pattern | "^Intensity"
(Matches "Intensity Coding", but not "AI Intensity") |
| $
| Matches if string ends with a pattern | "Coding$"
(Matches "Intensity Coding", but not "Coding Intensity") |
| *
| Matches 0 or more occurrences | "Int.*ity"
(Matches "Intensity", "Intelligent AI Community") |
| +
| Matches 1 or more occurrences | "Int.+ity"
(Matches "Intensity", but not "Intity") |
| ?
| Matches 0 or 1 occurrence | "Int.?ity"
(Matches "Intensity" or "Intity", but not "Intnsity") |
| {}
| Matches exact number of occurrences | "Int.{2}ity"
(Matches "IntXXity", "Int99ity") |
| \|
| Matches either/or condition | "AI\|ML"
(Matches "AI" or "ML") |
| ()
| Groups part of the pattern | "Intensity (Coding|Learning)"
(Matches "Intensity Coding" or "Intensity Learning") |
Metacharacters in Action (Code Examples)¶
1. []
– Match Specific Character Sets¶
- Find all lowercase letters from "a" to "m":
import re
text = "Intensity Coding and AI Learning"
# Find all letters from 'a' to 'm'
matches = re.findall("[a-m]", text)
print("Matched characters:", matches)
### Output: Matched characters: ['e', 'i', 'd', 'i', 'g', 'a', 'd', 'e', 'a', 'i', 'g']
Matched characters: ['e', 'i', 'd', 'i', 'g', 'a', 'd', 'e', 'a', 'i', 'g']
2. \d
– Find Digits in Text¶
- Extract all numeric values from a string
import re
text = "The AI model achieved 99% accuracy with 5000 training samples."
# Find all digits in the text
matches = re.findall("\d", text)
print("Matched digits:", matches)
# Output:
# Matched digits: ['9', '9', '5', '0', '0', '0']
Matched digits: ['9', '9', '5', '0', '0', '0']
3. .
– Match Any Character Except Newline¶
- Find words that start with
"In"
, followed by any two characters, and end with"ity"
import re
text = "Intensity Insight AI Int99ity In56ity"
# Match pattern: "In" followed by any two characters, then "ity"
matches = re.findall("In..ity", text)
print("Matched words:", matches)
### Output: Matched words: ['In56ity']
Matched words: ['In56ity']
4. ^
and $
– Match Start and End of String¶
- Check if a string starts with
"Intensity"
and ends with"Coding"
import re
text = "Intensity Coding is a platform for AI and ML tutorials"
# Check if text starts with "Intensity"
if re.findall("^Intensity", text):
print("Text starts with 'Intensity'.")
# Check if text ends with "Coding"
if re.findall("Coding$", text):
print("Text ends with 'Coding'.")
else:
print("No match found.")
## Output
# Text starts with 'Intensity'.
# No match found.
Text starts with 'Intensity'. No match found.
5. *
, +
, ?
– Match Repeating Characters¶
- Find different occurrences of
"In"
followed by characters before"ity"
import re
text = "Intensity Insight AI Int99ity Inity"
# Match 'In' followed by 0 or more characters before 'ity'
# Non-greedy match: makes sure it stops at the first possible "ity"
matches_star = re.findall("In.*?ity", text)
# Match 'In' followed by 1 or more characters before 'ity'
# Non-greedy match: makes sure it stops at the first possible "ity"
matches_plus = re.findall("In.+?ity", text)
# Match 'In' followed by 0 or 1 character before 'ity'
matches_question = re.findall("In.?ity", text)
print("Matches (*):", matches_star)
print("Matches (+):", matches_plus)
print("Matches (?):", matches_question)
### Output:
# Matches (*): ['Intensity', 'Insight AI Int99ity', 'Inity']
# Matches (+): ['Intensity', 'Insight AI Int99ity']
# Matches (?): ['Inity']
Matches (*): ['Intensity', 'Insight AI Int99ity', 'Inity'] Matches (+): ['Intensity', 'Insight AI Int99ity'] Matches (?): ['Inity']
6. {}
– Match Exact Number of Characters¶
- Find
"In"
followed by exactly 2 characters before"ity"
:
import re
text = "Intensity In55ity Int99ity Int888ity"
# Match 'In' followed by exactly 2 characters before 'ity'
matches = re.findall("In.{2}ity", text)
print("Matched words:", matches)
### Output
# Matched words: ['In55ity']
Matched words: ['In55ity']
7. |
– Match Either Condition¶
- Find if either "AI" or "ML" is present
import re
text = "Intensity Coding focuses on AI and ML concepts."
# Search for 'AI' or 'ML'
matches = re.findall("AI|ML", text)
if matches:
print("Match found:", matches)
else:
print("No match found.")
### Output
#Match found: ['AI', 'ML']
Match found: ['AI', 'ML']
RegEx Sets¶
- sets in regular expressions allow you to define a collection of characters inside square brackets (
[]
). - These characters can be matched in a string based on predefined rules.
Common Set Patterns in Regular Expressions¶
Set Pattern | Description |
---|---|
[agn] |
Matches any occurrence of 'a', 'g', or 'n'. |
[a-n] |
Matches any lowercase letter between 'a' and 'n'. |
[^agn] |
Matches any character except 'a', 'g', or 'n'. |
[0123] |
Matches any occurrence of '0', '1', '2', or '3'. |
[0-9] |
Matches any digit from '0' to '9'. |
[0-5][0-9] |
Matches any two-digit number between '00' and '59'. |
[a-zA-Z] |
Matches any lowercase or uppercase letter from 'a' to 'z'. |
[+] |
Matches the literal + character. Special characters inside sets lose their special meaning. |
Examples Using Regular Expression Sets¶
1. Matching Specific Characters ([agn]
)¶
import re
text = "Intensity Coding"
# Find occurrences of 'a', 'g', or 'n'
matches = re.findall("[agn]", text)
print(matches) # Output: ['n', 'n', 'n', 'g']
['n', 'n', 'n', 'g']
2. Matching Characters in a Range ([a-n]
)¶
text = "Intensity Coding"
# Find lowercase letters between 'a' and 'n'
matches = re.findall("[a-n]", text)
print(matches) # Output: ['n', 'e', 'n', 'i', 'd', 'i', 'n', 'g']
['n', 'e', 'n', 'i', 'd', 'i', 'n', 'g']
3. Matching Characters Except Certain Ones ([^agn]
)¶
text = "Intensity Coding"
# Find characters that are NOT 'a', 'g', or 'n'
matches = re.findall("[^agn]", text)
print(matches) # Output: ['I', 't', 'e', 's', 'i', 't', 'y', ' ', 'C', 'o', 'd', 'i']
['I', 't', 'e', 's', 'i', 't', 'y', ' ', 'C', 'o', 'd', 'i']
4. Finding Specific Digits ([0123]
)¶
text = "Intensity Coding was founded in 2023."
# Find occurrences of '0', '1', '2', or '3'
matches = re.findall("[0123]", text)
print(matches) # Output: ['2', '0', '2', '3']
['2', '0', '2', '3']
5. Finding Any Numeric Digit ([0-9]
)¶
text = "AI models have been trained on 500,000+ datasets."
# Find all digits (0-9)
matches = re.findall("[0-9]", text)
print(matches) # Output: ['5', '0', '0', '0', '0', '0']
['5', '0', '0', '0', '0', '0']
6. Matching Two-Digit Numbers (Range 00-59
)¶
text = "Training started at 09:45 AM."
# Find two-digit numbers between 00 and 59
matches = re.findall("[0-5][0-9]", text)
print(matches) # Output: ['09', '45']
['09', '45']
7. Finding Uppercase and Lowercase Letters ([a-zA-Z]
)¶
text = "NLP and Computer Vision are AI fields."
# Find all alphabetic characters
matches = re.findall("[a-zA-Z]", text)
print(matches)
# Output: ['N', 'L', 'P', 'a', 'n', 'd', 'C', 'o', 'm', 'p', 'u', 't', 'e', 'r', 'V', 'i', 's', 'i', 'o', 'n', 'a', 'r', 'e', 'A', 'I', 'f', 'i', 'e', 'l', 'd', 's']
['N', 'L', 'P', 'a', 'n', 'd', 'C', 'o', 'm', 'p', 'u', 't', 'e', 'r', 'V', 'i', 's', 'i', 'o', 'n', 'a', 'r', 'e', 'A', 'I', 'f', 'i', 'e', 'l', 'd', 's']
8. Finding Special Characters ([+]
)¶
import re
text = "The accuracy increased by +7% after tuning."
# Find occurrences of the '+' character
# In a set [], special characters like + lose their special meaning
# So, "[+]" will match the literal '+' character in the string
matches = re.findall("[+]", text)
print(matches)
# Output: ['+']
['+']
RegEx Special Sequences¶
- Regular Expressions (RegEx) in Python include special sequences that provide powerful ways to search for patterns in text. Special sequences are prefixed with a backslash (
\
), giving them a unique meaning beyond standard character matching.
Common Special Sequences in RegEx¶
Character | Description | Example |
---|---|---|
\A |
Matches the start of the string. | \AThe |
\b |
Matches the start or end of a word (word boundary). | r"\bAI" or r"AI\b" |
\B |
Matches characters inside a word (not at the start or end). | r"\BAI" |
\d |
Matches any digit (0-9 ). |
"\d" |
\D |
Matches any non-digit. | "\D" |
\s |
Matches any whitespace (space, tab, newline). | "\s" |
\S |
Matches any non-whitespace character. | "\S" |
\w |
Matches word characters (a-z , A-Z , 0-9 , _ ). |
"\w" |
\W |
Matches non-word characters (! , ? , whitespace, etc.). |
"\W" |
\Z |
Matches the end of the string. | "coding\Z" |
Examples Using Special Sequences¶
1. Checking if a String Starts with a Specific Word (\A
)¶
import re
text = "Intensity Coding provides AI tutorials."
# Check if the string starts with 'Intensity'
matches = re.findall("\AIntensity", text)
print(matches)
# Output: ['Intensity']
if matches:
print("Yes, the string starts with 'Intensity'!")
else:
print("No match found.")
# Output: Yes, the string starts with 'Intensity'!
['Intensity'] Yes, the string starts with 'Intensity'!
2. Finding a Word at the Start or End (\b
)¶
text = "AI is revolutionizing technology."
# Check if 'AI' is at the start of a word
matches = re.findall(r"\bAI", text)
print(matches) # Output: ['AI']
if matches:
print("Yes, 'AI' appears at the beginning of a word!")
else:
print("No match found.")
# Output: Yes, 'AI' appears at the beginning of a word!
['AI'] Yes, 'AI' appears at the beginning of a word!
import re
text = "The latest advancements in AI are impressive."
# Check if 'AI' is at the end of a word
matches = re.findall(r"AI\b", text)
print(matches)
# Output: ['AI']
if matches:
print("Yes, 'AI' appears at the end of a word!")
else:
print("No match found.")
# Output:
# Yes, 'AI' appears at the end of a word!
['AI'] Yes, 'AI' appears at the end of a word!
3. Finding a Word Inside, but Not at the Boundary (\B
)¶
text = "Brainpower is essential in AI research."
# Check if 'ain' appears inside a word (not at the beginning or end)
matches = re.findall(r"\Bain", text)
print(matches)
# Output: ['ain']
if matches:
print("Yes, 'ain' appears inside a word!")
else:
print("No match found.")
# Output: Yes, 'ain' appears inside a word!
['ain'] Yes, 'ain' appears inside a word!
4. Finding Digits (\d
)¶
text = "The accuracy improved by 98%."
# Find all numeric digits
matches = re.findall("\d", text)
print(matches) # Output: ['9', '8']
if matches:
print("Yes, the string contains digits!")
else:
print("No match found.")
# Output: Yes, the string contains digits!
['9', '8'] Yes, the string contains digits!
5. Finding Non-Digit Characters (\D
)¶
text = "ML models reached 99% accuracy."
# Find all non-digit characters
matches = re.findall("\D", text)
print(matches)
# Output: ['M', 'L', ' ', 'm', 'o', 'd', 'e', 'l', 's', ' ', 'r', 'e', 'a', 'c', 'h', 'e', 'd', ' ', '%', ' ', 'a', 'c', 'c', 'u', 'r', 'a', 'c', 'y', '.']
if matches:
print("Yes, the string contains non-digit characters!")
else:
print("No match found.")
# Output:
# Yes, the string contains non-digit characters!
['M', 'L', ' ', 'm', 'o', 'd', 'e', 'l', 's', ' ', 'r', 'e', 'a', 'c', 'h', 'e', 'd', ' ', '%', ' ', 'a', 'c', 'c', 'u', 'r', 'a', 'c', 'y', '.'] Yes, the string contains non-digit characters!
6. Finding Whitespaces (\s
)¶
text = "Deep Learning improves AI performance."
# Find all whitespace characters
matches = re.findall("\s", text)
print(matches)
# Output: [' ', ' ', ' ', ' ']
if matches:
print("Yes, the string contains whitespace characters!")
else:
print("No match found.")
# Output:
# Yes, the string contains whitespace characters!
[' ', ' ', ' ', ' '] Yes, the string contains whitespace characters!
7. Finding Non-Whitespace Characters (\S
)¶
text = "AI models are evolving!"
# Find all non-whitespace characters
matches = re.findall("\S", text)
print(matches)
# Output: ['A', 'I', 'm', 'o', 'd', 'e', 'l', 's', 'a', 'r', 'e', 'e', 'v', 'o', 'l', 'v', 'i', 'n', 'g', '!']
if matches:
print("Yes, the string contains non-whitespace characters!")
else:
print("No match found.")
# Output:
# Yes, the string contains non-whitespace characters!
['A', 'I', 'm', 'o', 'd', 'e', 'l', 's', 'a', 'r', 'e', 'e', 'v', 'o', 'l', 'v', 'i', 'n', 'g', '!'] Yes, the string contains non-whitespace characters!
8. Finding Word Characters (\w
)¶
text = "AI_2025 is the future of automation."
# Find all word characters
matches = re.findall("\w", text)
print(matches)
# Output: ['A', 'I', '_', '2', '0', '2', '5', 'i', 's', 't', 'h', 'e', 'f', 'u', 't', 'u', 'r', 'e', 'o', 'f', 'a', 'u', 't', 'o', 'm', 'a', 't', 'i', 'o', 'n']
if matches:
print("Yes, the string contains word characters!")
else:
print("No match found.")
# Output:
# Yes, the string contains word characters!
['A', 'I', '_', '2', '0', '2', '5', 'i', 's', 't', 'h', 'e', 'f', 'u', 't', 'u', 'r', 'e', 'o', 'f', 'a', 'u', 't', 'o', 'm', 'a', 't', 'i', 'o', 'n'] Yes, the string contains word characters!
9. Finding Non-Word Characters (\W
)¶
import re
text = "AI-powered tools: fast, reliable & scalable!"
# Find all non-word characters (!, ?, whitespace, etc.)
matches = re.findall("\W", text)
print(matches)
# Output: ['-', ' ', ':', ' ', ',', ' ', ' ', '&', ' ', '!']
if matches:
print("Yes, the string contains non-word characters!")
else:
print("No match found.")
# Output:
# Yes, the string contains non-word characters!
['-', ' ', ':', ' ', ',', ' ', ' ', '&', ' ', '!'] Yes, the string contains non-word characters!
10. Checking if a String Ends with a Specific Word (\Z
)¶
import re
text = "Machine Learning is the future of AI."
# Check if the string ends with 'AI.'
matches = re.findall("AI\Z", text)
print(matches)
# Output: []
if matches:
print("Yes, the string ends with 'AI'!")
else:
print("No match found.")
# Output:
# []
# No match found.
[] No match found.
# Let's modify the text to make it match
text2 = "The future of AI"
matches2 = re.findall("AI\Z", text2)
print(matches2)
# Output: ['AI']
if matches2:
print("Yes, the string ends with 'AI'!")
else:
print("No match found.")
# Output:
# Yes, the string ends with 'AI'!
['AI'] Yes, the string ends with 'AI'!