Regular Expressions In Python

Regular Expressions In Python

What is RegEx in Python: 

A RegExp or regular expression in Python is a sequence of characters that forms a search pattern. This search pattern then used to detect or check if a string contains a search patterns. The RegExp is widely used in UNIX.

The Python module re is used to for the searching regular expression. Dealing with a regular expressions we can use raw strings as r’expression’. Programmer can match and extract any string pattern from given text with the help of regular expression.

For example,

if we want to match the “Mr. ABC” keyword and then extract only name i.e. ABC from all the names from given list without taking Mr. Prefix in this problem we can use regular expressions in Python.

RegExp is widely used in texts, emails, documents. The regular expression is also called as string matching programming language.

Where to use Regular Expression:

Form Validations:

The regular expression is mostly used in form validations such as email validation, phone validations, password validations.

Account Details:

The credit cards, debit cards number have 16 digits and first few number represents the cards are Visa cards, Master cards or Rupay cards. To detects or search the specified pattern in given number regular expressions are used.

The IFC code of different banks starts with name of bank and some numbers to find sequence regular expressions are used.

Regular Expression in Data Mining/ NLP:

In data mining the unstructured data is to be converted into the structured form and then build the model and train the model to get final results. The transforming the dat from unstructured to structured form regular expression plays very important role.

Cleansing of data by removing stop words, special symbols, punctuations etc. Are removed by using regular expression.

RegEx Module:

Python has built-in package called re is used for regular expression. To work with regular expressions first we have to import re module as follows,

import re

Example

import re

name="My name is Manisha"

stringsearch=re.search("^My.*Manisha$",name)

if stringsearch:

    print("Yes match found")

else:

        print("No match found")

O/P:

Yes match found

In the above example name variable stores string My name is Manisha. One another variable stringsearch is used here to to store search pattern. The ^ symbol is used to find start character or word in given string. The . Is used to find any character except new line character. * is used to find zero or more occurrences in given string.  $ is used to find or to check the end character or word from given string.

Another Example,

India= 'India is my country, All Indians are my brothers and sisters'

match=re.search(r'brothers',India)

print("The start Index",match.start())

print("The end index",match.end())

O/P:

The start Index 40

The end index 48

 In the above example r is used for raw string brothers to find the starting index and end index of this raw string brothers. The raw string is selected from the given string to find the pattern  and it is denoted by r’rawstring’.

 Meta Characters:

Following meta characters are used in regular expression.

1.       \ ( Backslash):

The use of  \  is make sure that the character in given string is not treated in a special way. If you want to search any character from given string then you have to use backslash before that character so that the string is not treated specially.

2.       [] Square Bracket:

Square bracket is used to represent set of characters in it. We can write range of characters in between square bracket.

For example,

[0,9], [a-zA-Z]

 3.       ^ - Caret :

^ i.e caret is a symbol used to check the string startas with a given character or not. 

For example,

^M will check the given string is starts with more, mane, multiple etc.

4.       $- Dollar:

The dollar symbol is used to match with end of the string with a given character or not.

For example,

L$ is used to check the end of given string with the character L i.e. beautiful, colourful etc.

5.       .- Dot

The . is used  to check a single character except for net line character.

For example,

m.n will check string that contains the character at dot.

6.       |- Or:

The Or symbols works like logical Or. It checks the given pattern before or after the Or symbol in given string.

For example

m|n  will search and match any string which contains m or n  such as many, any, tc.

7. ?- Question Mark:

The question mark symbol checks if the string before the question mark in the regular expression occurred at leat once or not at all.

For example,

xy?z  will search the string for xz, xzy, mxyz but  it will not match xyyz because there are two y.

8.       *- Star:

* symbol matches zero or more occurances of the regular expression .

For example,

xy*z  will be matched  in the string where y will be followed by z

like xyzxyz, abxyz klmnopxyz etc.

9. +- Plus:

 The plus symbol matches one or more occurrences of the regular expression preceding the plus symbol.

For example,

xy+z

 Special Sequences Used in Regular Expression:

The special sequence in regular expression is \ followed by one of the character is given below,

1.     \A:

The \A is used to return the specified character at the beginning of the string. If match found return true else return false.

For example,  

# Use of \A sequence

import re

 string1 = "Hello Friends how are you"

 # Check if the string starts with "H":

 check = re.findall("\AH", string1)

 print(check)

 if check:

  print("Yes, string start with H!")

else:

  print("No , string does not start with H")

 O/P:

['H']

Yes, string start with H!

 

2.     \b:

Find specified character at beginning or at the end of given string.

For example,

# Use of \b sequence

import re

string1 = "Hello Friends how are you"

 # Check if the string starts with "llo" at the end of word:

 check = re.findall(r"llo\b", string1)

 print(check)

 if check:

  print("Yes, character is found!")

else:

  print("No , character doesnot found")

O/P:

['llo']

Yes, character is found!

 Another Program

# Use of \b sequence

import re

 string1 = "Hello Friends how are you"

 # Check if the string starts with "llo" at the begining of the word:

 check = re.findall(r"\b llo", string1)

 print(check)

 if check:

  print("Yes, character is found!")

else:

  print("No , character doesnot found")

 O/P:

[]

No , character doesnot found

 The r in the beginning is used to treat the string is raw string

 3.     \B :

This sequence is used to find the specified characters present in the string but not in the beginning of of the string.

For example

# Use of \B sequence

import re

 string1 = "rain rain come again"

 # Check  "ain" present but not at the begining of the word:

 check = re.findall(r"\Bain", string1)

 print(check)

 if check:

  print("Yes, character is found!")

else:

  print("No , character doesnot found")

 O/P:

['ain', 'ain', 'ain']

Yes, character is found!

 

4.     \d and \D:

\d checks the specified string contains numerics 0-9. And \D checks the specified string does not contain numeric i.e. 0-9.

For example,

# Use of \d sequence

import re

 string1 = "rain rain come again on 7 th april"

 # Check if string contain any digit

check = re.findall(r"\d", string1)

 print(check)

 if check:

  print("Yes, digit is found!")

else:

  print("No ,digit doesnot found")

 O/P:

['7']

Yes, digit is found!

 

Example for \D

# Use of \D sequence

import re

 string1 = "rain rain come again "

 # Check if string contain any digit return all characters which does not contain any digit

check = re.findall(r"\D", string1)

 print(check)

 if check:

  print("Yes, digit is found!")

else:

  print("No ,digit doesnot found")

 O/P:

['r', 'a', 'i', 'n', ' ', 'r', 'a', 'i', 'n', ' ', 'c', 'o', 'm', 'e', ' ', 'a', 'g', 'a', 'i', 'n', ' ']

Yes, match  found!

 5.     \s and \S:

\s returns a match where string contains white spaces character

And \S returns where string does not contains white spaces character

Example:

# Use of \s sequence

import re

 string1 = "rain rain come again "

 # returns match if string contains white spaces

check = re.findall(r"\s", string1)

 print(check)

 if check:

  print("Yes, white space  found!")

else:

  print("No ,white space  doesnot found")

O/P:

[' ', ' ', ' ', ' ']

Yes, white space  found!

 Example for \S sequence

# Use of \S sequence

import re

 string1 = "rainraincomeagain"

 # returns match if string does not contains white spaces

check = re.findall(r"\s", string1)

 print(check)

 if check:

  print("Yes, white space  found!")

else:

  print("No ,white space  doesnot found")

O/P:

[]

No ,white space  doesnot found

 6.     \w,\W:

\w returns the match where string contains any word character a-Z, 0-9, _ etc.

\W returns match where string doesn't contains any word character

Example:

# Use of \w sequence

import re

 string1 = "rainraincomeagain123"

 # returns match at every word character where a-Z,0-9 _ found

check = re.findall(r"\w", string1)

 print(check)

 if check:

  print("Yes, match  found!")

else:

  print("No ,match doesnot found")

 O/P:

['r', 'a', 'i', 'n', 'r', 'a', 'i', 'n', 'c', 'o', 'm', 'e', 'a', 'g', 'a', 'i', 'n', '1', '2', '3']

Yes, match  found!

 \W sequence example,

 # Use of \W sequence

import re

string1 = "rainraincomeagain123"

 # returns match at every non word character !,?, white spaces

check = re.findall(r"\W", string1)

 print(check)

 if check:

  print("Yes, match  found!")

else:

  print("No ,match doesnot found")

O/P:

[]

No ,match doesnot found

 7.     \Z:

Returns match if specified characters are at the end of the string

Example,

# Use of \Z sequence

import re

 string1 = "rain rain come again"

 # check if string ends with "again"

check = re.findall("again\Z", string1)

 print(check)

 if check:

  print("Yes, match  found!")

else:

  print("No ,match doesnot found")

 O/P:

['again']

Yes, match  found!

Functions used in Regular Expressions:

1.     Findall():

The findall() return characters containing all matches

 # Use of findall() function

import re

# find all 'i' characters in given string

string1="India is my country"

check=re.findall("i",string1)

print(check)

 O/P:

['i', 'i']

 2.      Search() Function:

The search() function is used to return a character or string which present anywhere in string. If match does not found then nothing is returned.

For example,

 # search function()

string1="India is my country"

check=re.search("himalay",string1)

print(check)

 O/P:

None

 3.     Split() Function:

The split function is used to returns a list if  string is splits at every match.

For example,

# split () Function

string1="My nation is India"

splits=re.split("\s",string1)

print(splits)

 

O/P:

['My', 'nation', 'is', 'India']

 \s is used to check where the string is splitted out. i.e. to find white spaces.

 We can control the splits by specifiying the number of splits you want to display.

For example,

# split () Function with specifiying max splits

string1="My nation is India"

splits=re.split("\s",string1,2)

print(splits)

O/P:

['My', 'nation', 'is India']

 In above 2 is given to display max 2 splits.

 4.     Sub() Function:

The sub() is used to replaces with one or more matches with another string.

For example,

# sub () Function 

string1="My nation is India"

replace=re.sub("\s","**",string1)

print(replace)

 O/P:

My**nation**is**India

 In the above program we can control number of replacement by using count parameter.

For example,

# sub () Function with count parameter 

string1="My nation is India"

replace=re.sub("\s","**",string1,1)

print(replace)

 O/P:

My**nation is India 

 


टिप्पणी पोस्ट करा

0 टिप्पण्या