Python 정규표현식

23 May 2017

특정한 패턴에 일치하는 복잡한 문자열을 처리하는 기법
파이썬에서는 표준모듈 re를 사용해서 정규표현식을 사용할 수 있다.

import re
result = re.match('Lux', 'Lux, the Lady of Luminosity')

match의 첫번째 인자는 패턴, 두번째 인자는 문자열 소스
match()는 소스와 패턴의 일치여부를 확인하고 일치하면 일치하는 문자열을 반환한다.(match object)
복잡하거나 자주 사용되는 패턴은 변수에 저장 가능
```
pattern1 = re.compile('패턴')
```

match

시작부분부터 일치하는 패턴을 찾는다.
match의 첫번째 인자는 패턴, 두번째 인자는 문자열 소스
match()는 소스와 패턴의 일치여부를 확인하고 일치하면 일치하는 문자열을 반환한다.(match object)
일치하는 문자열이 없으면 None을 반환한다.

import re

source = 'Lux, the Lady of Luminosity'

# 시작부분이 'Lux'인 값을 찾는다.
m = re.match('Lux', source)
result : 'Lux'

# 시작부분이 'Lady'인 값을 찾는다. 이 경우는 없으므로 결과는 None
m = re.match('Lady', source)
result : None

# 시작부분에 아무개의 문자가 있고 'Lady'가 나오는 값을 찾는다.
m = re.match('.*Lady', source)
result = 'Lux, the Lady'

search

패턴과 일치하는 첫번째 문자열을 반환한다.

m = re.search('Lady', source)

if m:
	print(m.group())

result : Lady

findall

패턴과 일치하는 모든 문자열을 리스트 형태로 반환한다.

>>> m = re.findall('y', source)
>>> m
['y', 'y']

>>> m = re.findall('y..', source)
>>> m
['y o']

split

문자열의 split()과 비슷하게 동작하지만 패턴을 사용할 수 있다.

>>> m = re.split('o', source)
>>> m
['Lux, the Lady ', 'f Lumin', 'sity']

sub

문자열의 replace()와 비슷하지만 패턴을 사용할 수 있다.

>>> m = re.sub('o', '!', source)
>>> m
'Lux, the Lady !f Lumin!sity'

정규표현식의 패턴 문자

패턴에 \를 써야한다면 패턴 시작에 꼭 r을 써주어야 한다.

패턴	문자
\d	숫자
\D	비숫자
\w	문자
\W	비문자
\s	공백 문자
\S	비공백 문자
\b	단어 경계 (\w와 \W의 경계)
\B	비단어 경계

import string
printable = string.printable

# 출력 가능한 모든 문자 출력
print(printable)

# 각 정규표현식의 패턴 문자 확인
print('\w', re.findall('\w', printable))
print('\W', re.findall('\W', printable))
print('\d', re.findall('\d', printable))
print('\D', re.findall('\D', printable))
print('\s', re.findall('\s', printable))
print('\S', re.findall(r'\S', printable))
print('\\b', re.findall(r'\b', printable))
print('\B', re.findall(r'\B', printable))

정규표현식의 패턴 지정자 (Pattern specifier)

expr은 정규표현식

패턴	의미
abc	리터럴 `abc`
(expr)	expr
expr1 \| expr2	expr1 또는 expr2
`.`	`\n`을 제외한 모든 문자
`^`	소스문자열의 시작
`$`	소스문자열의 끝
expr`?`	0 또는 1회의 expr
expr`*`	0회 이상의 최대 expr
expr`*?`	0회 이상의 최소 expr
expr`+`	1회 이상의 최대 expr
expr`+?`	1회 이상의 최소 expr
expr`{m}`	m회의 expr
expr`{m,n}`	m에서 n회의 최대 expr
expr`{m,n}?`	m에서 n회의 최소 expr
[abc]	a or b or c
[^abc]	not (a or b or c)
expr1(?=expr2)	뒤에 expr2가 오면 expr1에 해당하는 부분
expr1(?=expr2)	뒤에 expr2까 오지 않으면 expr1에 해당하는 부분
(?<=expr1)expr2	앞에 expr1이 오면 expr2에 해당하는 부분
(?<!expr1)expr2	앞에 expr1이 오지 않으면 expr2에 해당하는 부분

매칭 결과 그룹화

정규표현식 패턴 중, 괄호로 둘러쌓인 부분이 있을 경우, 결과는 해당 활호만의 그룹으로 지정된다.
groups() : 지정된 그룹의 리스트 반환
group() : 정규표현식이 가르키는 문자열
group(0) : 정규표현식이 가르키는 문자열
group(1) : 첫번째 매치된 그룹 요소
group(2) : 두번째 매치된 그룹 요소

s = re.search(r'\w+\w( was )', story)

# 지정된 그룹의 리스트 반환
print(s.groups())
--> (' was ',)

# 정규식이 가리키는 문자열 반환
print(s.group())
print(s.group(0))
--> Luxanna was

# 첫번째 괄호로 묶은 부분의 문자열
s.group(1)
--> was

그룹에 이름을 붙여서 사용 가능

(?P<그룹이름>정규표현식) 형태로 사용한다.

m = re.search(r'(?P<before>\w+)\s+(?P<was>was)\s+(?P<after>\w+)', story)

# 그룹으로 지정된 요소 모두를 반환
print(m.groups())
--> ('Luxanna', 'was', 'destined')

# 그룹이름 before의 매치데이터를 반환
print(m.group('before'))
--> Luxanna

# 그룹이름 was의 매치데이터를 반환
print(m.group('was'))
--> was

# 그룹이름 after의 매치데이터를 반환
print(m.group('after'))
--> destined

최대일치와 최소일치

* 나 + 에 최소 일치인 ? 를 붙여주면 정규식 다음부분에 해당하는 문자열이 처음 나올때 까지의 값을 반환한다.
? 가 없으면 정규식 다음부분에 해당하는 문자열이 마지막으로 나오는 부분까지를 반환한다.

Story

Born to the prestigious Crownguards, the paragon family of Demacian service, Luxanna was destined for greatness. She grew up as the family’s only daughter, and she immediately took to the advanced education and lavish parties required of families as high profile as the Crownguards. As Lux matured, it became clear that she was extraordinarily gifted. She could play tricks that made people believe they had seen things that did not actually exist. She could also hide in plain sight. Somehow, she was able to reverse engineer arcane magical spells after seeing them cast only once. She was hailed as a prodigy, drawing the affections of the Demacian government, military, and citizens alike.

As one of the youngest women to be tested by the College of Magic, she was discovered to possess a unique command over the powers of light. The young Lux viewed this as a great gift, something for her to embrace and use in the name of good. Realizing her unique skills, the Demacian military recruited and trained her in covert operations. She quickly became renowned for her daring achievements; the most dangerous of which found her deep in the chambers of the Noxian High Command. She extracted valuable inside information about the Noxus-Ionian conflict, earning her great favor with Demacians and Ionians alike. However, reconnaissance and surveillance was not for her. A light of her people, Lux’s true calling was the League of Legends, where she could follow in her brother’s footsteps and unleash her gifts as an inspiration for all of Demacia.

실습 1

{m}패턴지정자를 사용해서 a로 시작하는 4글자 단어를 전부 찾는다.

re.findall(r'(?<=\s)[Aa]\w{3}(?=\s)', story)

re.findall(r'\ba\w{3}\b', story)

실습 2

r로 끝나는 모든 단어를 찾는다.

re.findall(r'\w+?r[\b\W]', story)

re.findall(r'\w+r\b', story)

실습 3

a,b,c,d,e중 아무 문자나 3번 연속으로 들어간 단어를 찾는다.

re.findall(r'\w*?[abcde]{3}\w*?(?=\s)', story)

re.findall(r'\w*[abcde]{3}\w*', story)

실습 4 –확인!!!!

re.sub를 사용해서 ,로 구분된 앞/뒤 단어에 대해 앞단어는 대문자화 시키고, 뒷단어는 대괄호로 감싼다.
이 과정에서, 각각의 앞/뒤에 before, after그룹 이름을 사용한다.

def re_change(m):
    return '{}{}[{}]'.format(m.group('before').upper(), m.group('center'), m.group('after'))
    
p = re.compile(r'(?P<before>\w+)(?P<center>\s*,\s*)(?P<after>\w+)')

result = re.sub(p, re_change, story)

print(result)

# result = re.sub(p, '!!!\g<before>!!!\g<center>!!!\g<after>', story)
# result = re.sub(p, '{} {} {}]'.format('\g<before>.upper(), '\g<center>', '<\g<after>')

recordingbetter's devlog

Python, Django, DRF, Postgresql, AWS, Docker....