Regex In Python : Complete Tutorial for Data Scientist

Regex is one of the basic building block in Text Mining and Analytics . Trust me some time knowing basic regex can solve your major problems . Specially if want to see a certain pattern into text data . This tutorial will help you to understand the regex  in Python –

Regex In Python-

Python has inbuilt package re . There are some functions define into it .

  1. search
  2. split
  3. sub
  4. findall

1.search() –

It will return the first occurrence object if any match in the String with the pattern . In order to understand in more detail , please refer the below code –

import re
str = "ai is helping doctors to sort out Pain in Operations"
x = re.search("ai", str)
print("Search will only capture the first occurrence ", x)

Output –

Search will only capture the first occurrence <_sre.SRE_Match object; span=(0, 2), match=’ai’>

Explanations-

Here in the string “ai” as a pattern occurred in two times . But Search captures the first occurrence only .

2.Split()-

The function is used for split the string based on matched pattern . It will return the list object . Please refer the below code base –

import re
temp_str = "The rain in Spain"
x = re.split("ai", temp_str)
print(x)

Output –

['The r', 'n in Sp', 'n']

Explanations-

Here the user pattern was ” ai ” . The above code is breaking the string once it the given pattern .

3.sub() –

When you need to replace some pattern in some string by another pattern . Please refer the below example of sub() function –

import re
str = "I am Interested in AI"
x = re.sub("\s", "%%", str)
print(x)

Output –

I%%am%%Interested%%in%%AI

Explanations-

In the output , You may see the space is filled by “%%” .

4.findall()-

We use this function to identify the match pattern in the list . Please find the below code for the reference of findall().

import re
str = "I am Data Scientist and AI developer"
x = re.findall("a", str)
print(x)


Output –

['a', 'a', 'a', 'a']

Explanations-

As We have mention in the description . “a”  is a defined pattern which is occurring four times in the list .

How to generate the pattern –

In order to generate the pattern , There are some character in python which you may use –

  1. . | Dot -This signifies the occurrence of single character .
  2. ^ – This signifies the starting pattern .
  3. $ – This signifies the ending  pattern .
  4. * – One or more occurrence

Conclusion-

Regex is the matter of practice . Still the basic concepts are necessary .  In order to make expert hands on this topic , You need to solve real problems  of text mining . You need to practice the way of pattern creation using the symbols.  I hope you must have liked this article – Regex In Python : Complete Tutorial for Data Scientist .

As you think you want to add some information around the regex in Python , We welcome your suggestion . You may comment us or email us . If you think you want to contribute as a complete guest post . You may still provide that .

Thanks

Data Science Learner Team