[Pandas] Pandas_flavor로 Pandas API method 추가해보기

[Pandas] Pandas_flavor로 Pandas API method 추가해보기딥상어동의 딥한 데이터 처리/전처리2022. 3. 13. 03:06@딥상어동의 딥한생각

Table of Contents

가짜연구소 내 Python - Data Engineering 스터디를 위해 작성했습니다.

https://www.notion.so/chanrankim/PseudoLab-c42db6652c1b45c3ba4bfe157c70cf09

(가짜연구소 링크)

https://www.notion.so/chanrankim/Data-Engineer-Python-83c206a662004120a8211a800581e124

(스터디상세 링크)

0. 판다스를 사용하는 이유

https://qiita.com/alokrawat050/items/f807d193d1e677f6916f

왜 판다스를 사용할까? 내가 쉽사리 정의할 수는 없어서 관련된 글을 찾아봤다.

Less writing and more work done
https://data-flair.training/blogs/advantages-of-python-pandas/

나는 위 문구가 가장 와닿았다. 판다스는 짧다. 예를 들어, Group By 하는 연산을 SQL과 비교해보자.

data.groupby('group1')['value1'].mean()

판다스는 한줄이면 끝난다.

select group1, avg(value1) as avg_value1
from table1
group by group1

SQL은 조금 더 필요하다. 물론, 각자 언어마다 존재하는 이유가 다르니 뭐가 좋고 나쁘다를 말하려는 것은 아니다. 단지, 판다스를 쓰면 코드가 짧아진다는 점을 강조하고 싶었다.

1. 메서드 체인(method chaining)

판다스 코드가 짧아지는 이유 중 하나는 메서드 체인이 아닐까 한다.

data_raw.groupby(['group1'])['value1'].mean().reset_index(drop=True).head()

지금 여기서도 4개의 함수가 연이어서 작동한다. 그룹을 지어주는 함수, 그룹을 기준으로 평균을 구하는 함수, 그리고 index를 재정렬해주는 함수, 마지막으로 윗 행 5개만 뽑는 함수. 이렇게 4가지 함수를 한줄로 사용할 수 있다.

왜 이렇게 사용이 가능한걸까?

print(type(data_raw.groupby(['일자'])['계(명)']))
print(type(data_raw.groupby(['일자'])['계(명)'].mean()))
print(type(data_raw.groupby(['일자'])['계(명)'].mean().reset_index()))
print(type(data_raw.groupby(['일자'])['계(명)'].mean().reset_index().head()))

<class 'pandas.core.groupby.generic.SeriesGroupBy'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>

메서드 단위별로 type을 호출해보자. SeriesGroupBy, Series, DataFrame, DataFrame 이렇게 호출된다. 한번, 판다스의 공식 문서를 살펴보자.

pandas.core.groupby.GroupBy.mean
Returns: pandas.Series or pandas.DataFrame

pandas.DataFrame.reset_index
Returns: DataFrame or None

pandas.DataFrame.head
Returns : same type as callerThe first n rows of the caller object.

Return 형식을 보자. 판다스 시리즈/혹은 데이터 프레임으로 호출 하는 것을 알 수 있다. head()의 경우 object의 type을 그대로 상속받는다. 이처럼 판다스 함수가 시리즈/데이터 프레임 형태의 객체를 지속적으로 호출해준다는 것을 알 수 있다. 그렇기 때문에 이 객체를 통해 다른 함수를 호출할 수 있는 것이다.

갑자기, 이게 뭔 뚱딴지 같은 소리냐 할 수도 있겠지만.. 그ㅐㄹ 뭐 딱히 제목을 지을게 없어서 아무말이나 적었다. 지금까지 판다스 메서드 체인이 좋다는 얘기를 했다.

근데, 나도 체인할 수 있는 판다스 메서드를 만들어볼 수 있지 않을까?

2. 클래스와 인스턴스

그전에 클래스와 인스턴스 개념에 대해서 짤막하게 다루고 넘어가겠다.

class A: 

	def __init__(self, val1, val2):
		self.val1 = val1
		self.val2 = val2
        
	def add(self):
		return self.val1 + self.val2
        
	def subtract(self):
		return self.val1 - self.val2
        
        
b = A(1, 2)
b.add() >> 3
b.subtract() >> -1

위와 같이 b라는 변수에 A라는 클래스를 할당하였다. 이때, b는 클래스A의 인스턴스가 된다. 그리고, 인스턴스 b는 클래스A를 상속받았기 때문에 클래스A내부 함수 add와 subtract를 이용할 수 있다.

판다스 데이터프레임도 마찬가지이다. 우리가 data = pd.DataFrame()을 선언하는 순간 data는 데이터 프레임 클래스의 인스턴스가 되고 데이터프레임 클래스를 상속 받는다. 그래서, 우리가 pd.DataFrame 내부 다양한 함수들을 사용할 수 있는 것이다.

근데, 여기에 만약 나만의 함수를 추가하고 싶다면? 어떻게 해야할까

3. Pandas-flavor

Pandas-flavor만 알면 뭐든 할 수 있다. 일단, 묻지도 따지지도 않고 하나 먼저 만들어보자.

https://gibles-deepmind.tistory.com/103

[지수함수] - 우리가 코로나 확진자 수에 놀라는 이유

0. 들어가며 2019.11.17 코로나19가 우리의 일상을 덮쳤다. 하지만, 한동안은 코로나19 이슈에도 불구하고 비교적 잠잠했었는데 그 이유는 2020년 3월 이후 한동안은 확진자 수가 100명 아래였기 때문

gibles-deepmind.tistory.com

일전에 이 글에서 코로나 데이터가 가지고 잘 알지도 못하면서 한번 깝친적이 있다. 데이터 수급이 매우 귀찮으니, 해당 데이터를 그대로 활용하겠다.

import pandas as pd
import numpy as np
import pandas_flavor as pf

# 코로나 데이터
data_raw = pd.read_csv("https://raw.githubusercontent.com/GiblesDeepMind"
                       "/deepPythonAnalysis/master/interpretation/covid19_korea.csv"
                       , encoding='cp949', parse_dates=['일자'])

깃헙에 데이터를 올려놔서 주소만 참조해주면 끝. (파이참에서 자동으로 코드를 이쁘게 만들어줘서 두 줄로 되어 있는데, 혹시 다른 에디터에서 에러가 난다면 "이부분을 없애주고 다 붙이면 될 것 같다.)

이 데이터는 요로코롬 생겼다. 각 일자별로 확진자 수가 있다. 이 데이터를 가지고 믿도 끝도 없이, Pandas 메서드를 추가해보자.

지정한 일자로부터 n일 후 집계 값 구하기

# 필요 라이브러리 설치
import pandas as pd
import numpy as np
import pandas_flavor as pf
from datetime import datetime
from dateutil.relativedelta import relativedelta

# 판다스 데이터프레임 객체에 메서드 추가하기
@pf.register_dataframe_method
def idx_after_nday(df: pd.DataFrame, start_date: str, nday: int):
    """시작일로부터 nday 이후 값을 인덱싱합니다."""
    end_date = datetime.strptime(start_date, '%Y-%m-%d') + relativedelta(days=nday)
    return df.loc[start_date:end_date]

일단, 만들어 봤는데 한번 실행해보자.

data_raw.idx_after_nday('2020-02-01', 10)

아주 잘 나온다.

data_raw.idx_after_nday('2020-02-01', 10)['계(명)'].sum()
>> 17

연산도 짧게 할 수 있다.

사실, 이 모든 삽질은 데코레이터라는 놈을 이해하기 위해 진행했다. 그냥 재미삼아 pandas-api를 확장해보실 분들은 여기까지만 알아도 충분히 재미를 보실 수 있을 것 같다. 아래는 코알못인 관계로 본인?의 이해를 위해 작동원리를 적어보았다.

4. 작동 원리

결론만 먼저 적으면 다음과 같다.

결론적으로 @pf.register_dataframe_method 의 역할은 다음과 같다.

데코레이터 아래에서 생성한 함수를 판다스 데이터프레임 클래스 내의 method로
임시적으로 추가한다. 이를 통해, 판다스 데이터프레임 클래스를 상속받은 인스턴스에서
내가 개별적으로 선언한 method를 이용할 수 있게 된다.

어떻게 작동하는 것일까? Pandas-flavor 내에는 register_dataframe_accessor와 register_dataframe_method 크게 두 가지 함수가 있다.

register_dataframe_accessor
accessor라는 판다스 클래스를 상속받는 객체를 생성한다. accessor를 기점으로 다양한 함수들을 추가할 수 있다.
(자세한 내용은 아래 공식문서 참조)

register_dataframe_method
accessor없이 직접적으로 판다스 객체에 함수를 추가할 수 있다.

register_dataframe_accessor는 accessor 아래에 여러 함수들을 추가할 수 있는 class와 같은 개념이고 register_dataframe_method는 함수를 하나씩 추가할 수 있는 방식이다. 본 글에서는 register_dataframe_method를 사용했으며, accessor가 궁금하신 분들은 공홈을 한번 보시길.

# https://github.com/Zsailer/pandas_flavor/blob/master/pandas_flavor/register.py

def register_dataframe_method(method):
    """Register a function as a method attached to the Pandas DataFrame.
    Example
    -------
    .. code-block:: python
        @register_dataframe_method
        def print_column(df, col):
            '''Print the dataframe column given'''
            print(df[col])
    """
    def inner(*args, **kwargs):

        class AccessorMethod(object):


            def __init__(self, pandas_obj):
                self._obj = pandas_obj

            @wraps(method)
            def __call__(self, *args, **kwargs):
                return method(self._obj, *args, **kwargs)

        register_dataframe_accessor(method.__name__)(AccessorMethod)

        return method

    return inner()

pandas-flavor의 공식 문서 코드를 그대로 들고 왔다. 코알못이지만 하나씩 살펴보겠다.

1. AccessorMethod(object)

1-1. def __init__(self, pandas_obj):
init은 클래스 내의 생성자 함수이다. 한줄로 요약하자면 pandas_obj라는 값 없이 클래스를 생성하지 말라는 의미이다.

1-2. def __call__(self, *args, **kwargs):
return method(self._obj, *args, **kwargs)
call은 class를 함수처럼 호출할 수 있게 해주는 함수이다. AccessorMethod라는 클래스를 상속해주면 method를 함수처럼 호출 할 수 있게 해준다.

register_dataframe_accessor(method.__name__)(AccessorMethod)
그렇다면, 이자식은 누군가?

저자식을 알기 위해서는 pandas.api.extensions 코드를 살펴봐야 한다.

# https://github.com/pandas-dev/pandas/blob/06d230151e6f18fdb8139d09abf539867a8cd481/pandas/core/accessor.py#L192

@doc(klass="", others="")
def _register_accessor(name, cls):

    def decorator(accessor):
        if hasattr(cls, name):
            warnings.warn(
                f"registration of accessor {repr(accessor)} under name "
                f"{repr(name)} for type {repr(cls)} is overriding a preexisting "
                f"attribute with the same name.",
                UserWarning,
                stacklevel=find_stack_level(),
            )
        setattr(cls, name, CachedAccessor(name, accessor))
        cls._accessors.add(name)
        return accessor

    return decorator


@doc(_register_accessor, klass="DataFrame")
def register_dataframe_accessor(name):
    from pandas import DataFrame

    return _register_accessor(name, DataFrame)

register_dataframe_accessor(method.__name__)(AccessorMethod)
1. _register_accessor 함수에서는 우선 name을 받는다. 즉, 앞서 정의한 method.__name__을 받아준다. 그리고, 자료구조를 dataframe으로 받는다.

register_dataframe_accessor(method.__name__)(AccessorMethod)
2. 다음으로 AccessorMethod를 decorator 내부 함수의 accessor로 받는다. 앞서, register_dataframe_method에서 method를 __call__ 함수의 return 값으로 선언했기 때문에 이렇게 AccessorMethod를 넣어주면 method이름으로 method 함수를 호출할 수 있게 된다.

# 샘플 함수
def sample1(a:int):
    def sample2(b:int):
        return a + b
    return sample2


sample1(3)(3) 
>> 6

이렇게 함수를 감싸게 되면 위와 같이 괄호를 두번 사용할 수 있다.(참고)

Ref.

https://recoveryman.tistory.com/368

메서드 체인(Method Chaining)

메서드 체이닝 메서드가 객체를 반환하게 되면, 메서드의 반환 값인 객체를 통해 또 다른 함수를 호출할 수 있습니다. 이러한 프로그래밍 패턴을 메서드 체이닝(Method Chaining) 이라 부릅니다. 마

recoveryman.tistory.com

https://pypi.org/project/pandas-flavor/

pandas-flavor

The easy way to write your own Pandas flavor.

pypi.org

https://morioh.com/p/dde9c061e98d

Easiest How to Write Your Own Flavor Of Pandas

In this tutorial, we'll share Easiest How to Write Your Own Flavor Of Pandas. Adding support for registering methods as well, making each of these functions...

morioh.com

https://github.com/pandas-dev/pandas/blob/06d230151e6f18fdb8139d09abf539867a8cd481/pandas/core/accessor.py#L154

GitHub - pandas-dev/pandas: Flexible and powerful data analysis / manipulation library for Python, providing labeled data struct

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more - GitHub - pandas-dev/...

github.com

https://github.com/Zsailer/pandas_flavor/tree/master/pandas_flavor

GitHub - Zsailer/pandas_flavor: The easy way to write your own flavor of Pandas

The easy way to write your own flavor of Pandas. Contribute to Zsailer/pandas_flavor development by creating an account on GitHub.

github.com

'딥상어동의 딥한 데이터 처리 > 전처리' 카테고리의 다른 글

[Pandas] 수치형 컬럼과 범주형 컬럼 구분하기 (0)	2022.07.29
[Python] np.where을 이용하여 두 개의 데이터프레임 전체를 비교하기 (0)	2022.01.23
[Pandas] 판다스에서 SQL 윈도우 함수 사용해보기 (0)	2021.12.27
정규표현식 뽀개기 (3) - 반복 하기 (0)	2021.10.24
정규표현식 뽀개기 (2) - 메타 문자 이해하기 (0)	2021.10.24

@딥상어동의 딥한생각 :: 딥상어동의 딥한생각

제 블로그에 와주셔서 감사합니다! 다들 오늘 하루도 좋은 일 있으시길~~

포스팅이 좋았다면 "좋아요❤️" 또는 "구독👍🏻" 해주세요!