[python][Polars]pythonの新データフレームlibrary、Polarsについて

Table of Contents

Polarsとは

Rustベースに作られたデータフレームのLibrary。Pandasよりも高速で直感的で使いやすい。以下の例でも出てくるが、基本的にRのdplyrのように文法をつなげていくという使い方になる。速度に関しては公式ドキュメントに以下のようなものがあげられている。

Install

いつものpipで。

pip install polars

importは以下。後の便利のため他のlibraryもimportしておく。

import polars as pl

# to enrich the examples in this quickstart with dates
from datetime import datetime, timedelta 
# to generate data for the examples
import numpy as np

Objectの作成

pythonのlistやtupleを用いる。

# with a tuple
series = pl.Series("a", [1, 2, 3, 4, 5])

print(series)

# with a list
series = pl.Series([1, 2, 3, 4, 5])

print(series)

データフレームはpandasのようにdictを使う。

dataframe = pl.DataFrame({"integer": [1, 2, 3], 
                          "date": [
                              (datetime(2022, 1, 1)), 
                              (datetime(2022, 1, 2)), 
                              (datetime(2022, 1, 3))
                          ], 
                          "float":[4.0, 5.0, 6.0]})

print(dataframe)

ここで”pyarrow”のinstallを求められる場合は、pipしておく。

csvなどの読み込み

これもほとんどpandasと変わらない。

dataframe.write_csv('output.csv')
#csvへの書き出し

df_csv = pl.read_csv('output.csv')

print(df_csv)
#csvの読み込み

ここで日時データは文字列として読み込まれているので、次のように指定する。

df_csv_with_dates = pl.read_csv('output.csv', parse_dates=True)

print(df_csv_with_dates)

JSONやparquetでもやることはほとんど変わらない。

データフレームの確認

df = pl.DataFrame({"a": np.arange(0, 8), 
                   "b": np.random.rand(8), 
                   "c": [datetime(2022, 12, 1) + timedelta(days=idx) for idx in range(8)],
                   "d": [1, 2.0, np.NaN, np.NaN, 0, -5, -42, None]
                  })

print(df)

先頭\(5\)行をみるのはhead(5)で、最後の\(5\)行はtail(5)で。

df.head(5)
df.tail(5)

ランダムに何行かみたいときは、sampleを使う。

df.sample(n = 3)

統計量の要約はdescribeで。

df.describe()

Expressionについて

これがPolarsのcoreな機能。”select”, “filter”, “with_columns”に分かれる。

select

特定の列だけを選択する場合、以下のように指定する。

df.select(
    pl.col(['a', 'b'])
)

以下のようにリスト形式で指定することも可能。

# in this example we limit the number of rows returned to 3, as the comparison is clear.
# this also shows how easy we can extend our expression to what we need. 
df.select([
    pl.col('a'),
    pl.col('b')
]).limit(3)

特定の列を除いて表示したいときは、excludeを用いる。

df.select([
    pl.exclude('a')
])

Filter

Filterで条件を満たす部分を抽出できる。

df.filter(
    pl.col("c").is_between(datetime(2022, 12, 2), datetime(2022, 12, 8)),
)

次の例は結構複雑。でもこういう指定も可能。

df.filter(
    (pl.col('a') <= 3) & (pl.col('d').is_not_nan())
)

with_columns

新しい列をつくるときに用いる。下の例では”b”の和を”e”に作成し、さらに”b”に\(42\)を加えたものを”b+42″という列に格納している。

df.with_columns([
    pl.col('b').sum().alias('e'),
    (pl.col('b') + 42).alias('b+42')
])

Groupby

グループごとのまとめを作成する。以下でもこれまでの例と異なるdf2を用いる。

df2 = pl.DataFrame({
                    "x": np.arange(0, 8), 
                    "y": ['A', 'A', 'A', 'B', 'B', 'C', 'X', 'X'],
})

print(df2)

df2.groupby("y", maintain_order=True).count()

上の例ではcountデータを表示しているが、次のようなことも可能。

df2.groupby("y", maintain_order=True).agg([
    pl.col("*").count().alias("count"),
    pl.col("*").sum().alias("sum")
])

組み合わせ

もちろんwith_columnsやfilterを組み合わせることも可能。

# create a new column that multiplies column `a` and `b` from our DataFrame
# select all the columns, but exclude column `c` and `d` from the final DataFrame

df_x = df.with_columns(
    (pl.col("a") * pl.col("b")).alias("a * b")
).select([
    pl.all().exclude(['c', 'd'])
])

print(df_x)

# only excluding column `d` in this example

df_y = df.with_columns([
    (pl.col("a") * pl.col("b")).alias("a * b")
]).select([
    pl.all().exclude('d')
])

print(df_y)

データフレームの結合

ここらへんもpandasと変わらない。

df = pl.DataFrame({"a": np.arange(0, 8), 
                   "b": np.random.rand(8), 
                   "c": [datetime(2022, 12, 1) + timedelta(days=idx) for idx in range(8)],
                   "d": [1, 2.0, np.NaN, np.NaN, 0, -5, -42, None]
                  })

df2 = pl.DataFrame({
                    "x": np.arange(0, 8), 
                    "y": ['A', 'A', 'A', 'B', 'B', 'C', 'X', 'X'],
})