[Python][Polars]Polarsで行うデータクリーニングの実際

Table of Contents

データフレームの準備

前回の記事でせっかくCSVファイルを作成したので、これを例にデータクリーニングをPolarsで行なってみる。データクリーニング前のCSVファイルはこちら。Polarsでもpandasでも、こういったデータクリーニングは実際のデータを見て、自分がやってみたいこと・欲しいデータの形を作るのが一番勉強になる。

必要ライブラリのimport

データクリーニングに必要なライブラリをimportしておく。

import polars as pl
import numpy as np
import datetime
import os
import math

CSVファイルの読み込み

読み込み。ディレクトリは適当に変更する。glimpse()を行うことで簡単にデータの中身を確認できる。

#Read CSV
df = pl.read_csv("/Users/Documents/merged.csv")
df.glimpse()

Rows: 24259
Columns: 11
$ 年度             <i64> 1999, 1999, 1999, 1999, 1999, 1999, 1999, 1999, 1999, 1999
$ 大会             <str> ゼロックススーパーカップ, Ｊ１ １ｓｔ, Ｊ１ １ｓｔ, Ｊ１ １ｓｔ, Ｊ１ １ｓｔ, Ｊ１ １ｓｔ, Ｊ１ １ｓｔ, Ｊ１ １ｓｔ, Ｊ１ １ｓｔ, サテライトリーグ Ｂグループ
$ 節              <str> None, 第１節第１日, 第１節第１日, 第１節第１日, 第１節第１日, 第１節第１日, 第１節第１日, 第１節第１日, 第１節第１日, 第１日
$ 試合日            <str> 02/27(土), 03/06(土), 03/06(土), 03/06(土), 03/06(土), 03/06(土), 03/06(土), 03/06(土), 03/06(土), 03/07(日)
$ K/O時刻          <str> 14:52, 14:03, 15:03, 15:03, 16:03, 16:04, 16:04, 16:07, 19:04, 14:02
$ ホーム            <str> 鹿島, 名古屋, 浦和, Ｃ大阪, 磐田, 鹿島, 清水, 横浜FM, 柏, Ｖ川崎
$ スコア            <str> 2-1, 1-0, 2-1, 2-1, 3-1, 4-0, 3-1, 2-0, 1-0, 2-2
$ アウェイ           <str> 清水, 福岡, Ｇ大阪, 広島, 神戸, 市原, Ｖ川崎, 平塚, 京都, 磐田
$ スタジアム          <str> 国立, 瑞穂陸, 駒場, 長居, 磐田, 国立, 日本平, 横浜国, 柏, よみうり
$ 入場者数           <str> 28,520, 18,543, 18,639, 6,697, 13,349, 19,531, 11,169, 36,214, 9,018, 352
$ インターネット中継・TV放送 <str> 日本テレビ（生）, 名古屋テレビ／ＢＳ（生）, テレビ埼玉／Ｊ－ＳＰＯＲＴＳ, ディレクＴＶ（録）, 静岡放送, ＢＳ（録）／ディレクＴＶ, テレビ静岡／スカイパーフェクＴＶ（録）, ＢＳ（生）, ＢＳ（生）, None

これを解析しやすいようにきれいにしていく。

ソートと名前の変更

まずは試合の行われた年度でソートして、また列名を自分にとってわかりやすい形に変更する。個人的には文字化けの問題や、打ち込みやすさもあり、日本語名ではなく英語で変数名をつけることが多い。

#Sort by year
df = df.sort("年度")
#Rename
df = df.rename({"年度":"year", "大会":"tournament", "節":"section", "試合日":"day",
            "K/O時刻": "time", "ホーム":"home", "スコア":"score", "アウェイ":"away",
            "スタジアム":"stadium", "入場者数":"No_visitors",
            "インターネット中継・TV放送": "Internet_TV_broadcasting"})

名前の変更は

df.rename()

でよく、ここに辞書形式のリストを渡せば名前を変更してくれる。

型の変換

文字型と整数型の変換はデータクリーニングで重要な作業になる。

#year
    #convert from int to str
df = df.with_columns(pl.col("year").cast(str).alias("year"))

型の変換は、pandasでは”astype”だったが、Polarsでは”cast”を用いる。型一覧については、以下のリンクを参考。

Data types — Polars documentation

また、注意点として、Polarsでは変数の変更を行うときは、必ず”with_columns”を使って自分自身を置き換えるという形で更新を行う。

ある文字を含む行へのフラグ

例えば、J-leagueのJ1の試合かどうかを知りたいときは、以下のように文字列の中に”Ｊ１”を含むかどうかでフラグを立てることができる。

#tournament
df = df.with_columns(pl.when(pl.col("tournament").str.contains("Ｊ１")).then(1).otherwise(0).alias("J1"))

文字列から欲しい値のみを抽出

ここから少しややこしくなってくる。節は”第１節２日”などのように、全角数字も交えて記載されている。まずは、数字のみを抜き出し、”section2_1″と”section2_2″という新しい列に格納する。

#Section
df = df.with_columns(pl.col("section").str.extract(r"第(\d+)節").alias("section2_1"))
df = df.with_columns(pl.col("section").str.extract(r"第(\d+)日").alias("section2_2"))

これで、第何節かという情報と、節の中での日にちを抽出できているので、全角数字を半角に置き換えていく。ここが美しくないところで、正規表現を用いてきれいなcodeを書きたかったが、断念。

df = df.with_columns(pl.col("section2_1").str.replace_all("１", "1").alias("section2_1"))
df = df.with_columns(pl.col("section2_1").str.replace_all("２", "2").alias("section2_1"))
df = df.with_columns(pl.col("section2_1").str.replace_all("３", "3").alias("section2_1"))
df = df.with_columns(pl.col("section2_1").str.replace_all("４", "4").alias("section2_1"))
df = df.with_columns(pl.col("section2_1").str.replace_all("５", "5").alias("section2_1"))
df = df.with_columns(pl.col("section2_1").str.replace_all("６", "6").alias("section2_1"))
df = df.with_columns(pl.col("section2_1").str.replace_all("７", "7").alias("section2_1"))
df = df.with_columns(pl.col("section2_1").str.replace_all("８", "8").alias("section2_1"))
df = df.with_columns(pl.col("section2_1").str.replace_all("９", "9").alias("section2_1"))
df = df.with_columns(pl.col("section2_1").str.replace_all("０", "0").alias("section2_1"))

df = df.with_columns(pl.col("section2_2").str.replace_all("０", "0").alias("section2_2"))
df = df.with_columns(pl.col("section2_2").str.replace_all("１", "1").alias("section2_2"))
df = df.with_columns(pl.col("section2_2").str.replace_all("２", "2").alias("section2_2"))
df = df.with_columns(pl.col("section2_2").str.replace_all("３", "3").alias("section2_2"))
df = df.with_columns(pl.col("section2_2").str.replace_all("４", "4").alias("section2_2"))
df = df.with_columns(pl.col("section2_2").str.replace_all("５", "5").alias("section2_2"))
df = df.with_columns(pl.col("section2_2").str.replace_all("６", "6").alias("section2_2"))
df = df.with_columns(pl.col("section2_2").str.replace_all("７", "7").alias("section2_2"))
df = df.with_columns(pl.col("section2_2").str.replace_all("８", "8").alias("section2_2"))
df = df.with_columns(pl.col("section2_2").str.replace_all("９", "9").alias("section2_2"))
df = df.with_columns(pl.col("section2_2").str.replace_all("０", "0").alias("section2_2"))

両方半角数字にできた。これを文字型ではなく、整数型に変換する。

df = df.with_columns(pl.col("section2_1").cast(pl.Int32).alias("section2_1"))
df = df.with_columns(pl.col("section2_2").cast(pl.Int32).alias("section2_2"))

日付の情報

日付は”02/07(土)”や”10/22(金・祝)”などといった形式になっている。まずは、ここから月日と曜日を抜き出す。ここで、曜日に”未定”というものが含まれている。恐らく試合が行われなかった日がそのまま残っているものと考えられる。後で型変換するときにエラーを吐き出す原因になるので、これを除いておく。自戒も込めて記載するが、本来であれば、こういった欠損値や異常値のチェックは一番はじめに行うべきである。異常値に気が付かずにデータクリーニングを進めると高い確率でエラーに遭遇するし、解析の結果を歪めかねない。さて、さらに、祝日のフラグも立てたいので、str.containsを用いて”・祝”が含まれる行を1に、それ以外を0にする。

#day
df = df.with_columns(pl.col("day").str.slice(0, 5).alias("day2_1"))
df = df.with_columns(pl.col("day").str.slice(6, 1).alias("day2_2"))
　　　　#未定を削除
df = df.filter((pl.col("day2_1") != "未定"))

df = df.with_columns(pl.when(pl.col("day").str.contains("・祝")).then(1).otherwise(0).alias("nholiday"))
df["nholiday"].sum()#1655

最後に、曜日を英語に置き換えておく。


df = df.with_columns(pl.col("day2_2").str.replace("日", "Sunday").alias("day2_2"))
df = df.with_columns(pl.col("day2_2").str.replace("月", "Monday").alias("day2_2"))
df = df.with_columns(pl.col("day2_2").str.replace("火", "Tuesday").alias("day2_2"))
df = df.with_columns(pl.col("day2_2").str.replace("水", "Wednesday").alias("day2_2"))
df = df.with_columns(pl.col("day2_2").str.replace("木", "Thursday").alias("day2_2"))
df = df.with_columns(pl.col("day2_2").str.replace("金", "Friday").alias("day2_2"))
df = df.with_columns(pl.col("day2_2").str.replace("土", "Saturday").alias("day2_2"))

日付の作成

せっかく日時データをクリーニングしたので、Polars形式の日時データも作成しておく。

#time
df = df.with_columns((pl.col("year")+"/"+pl.col("day2_1")).alias("Date"))

df = df.with_columns(pl.col("Date").str.strptime(pl.Date, fmt = "%Y/%m/%d").cast(pl.Date))

str.strptimeでフォーマットを整え、pl.Date形式になるようにcastしている。

試合結果の分離と整数化

試合結果は”1-0″や”2-3(PK3-4)”などとなっている。この場合は、左側の数値（最初の例だと1）がホームチームの得点で、右側の数字（最初の例だと0）がアウェイ側の得点になっている。なので、正規表現を用いて得点を分離していく。

#score
df.filter(pl.col("score").str.contains("10"))
df.select(pl.col("score")).unique()

最初のコードで、”9-10″などのように得点が10点以上になっている試合の数をカウントしている。次のコードはユニークな値の列挙。個数を知りたいときは”n_unique()”とする。

df = df.with_columns(pl.col("score").str.replace("vs", "0").alias("score"))
df = df.with_columns(pl.col("score").str.replace("中止", "0").alias("score"))

試合が中止になったり、これから行われる予定のものは”中止”や”vs”となっているの、これは”0″に置き換えておく。ちなみに、本当はNullに置き換えたかったが、方法がわからなかった。

    #home_score
df = df.with_columns(pl.col("score").str.split("-").apply(lambda x: x[0]).alias("home_score"))
df = df.with_columns(pl.col("home_score").cast(pl.Int32).alias("home_score"))
    #away_score
df = df.with_columns(pl.col("score").str.extract(r"-(\d+)").alias("away_score"))
df = df.with_columns(pl.col("away_score").cast(pl.Int32).alias("away_score"))
    #PK
df = df.with_columns(pl.when(pl.col("score").str.contains("PK")).then(1).otherwise(0).alias("PK"))

正規表現で”2-3″などとあるとき、”2″をホームチームのスコア、”3″をアウェイチームのスコアとしている。また、PK戦はPK戦にもつれたというフラグを立てている。これは、例えばこのデータをスポーツくじの予想に用いる場合、PK戦でどちらのチームが勝利したかという情報が不要であるからである（予想は3択で、「ホームチームが90分勝ち」、「その他」、「ホームチームが90分負け」のいずれかを選べばいいものであるから）。

ビジターの数の整数化、放送の有無

最後は一気に。ビジターの数は文字型で”12,345″などとなっているので、コンマを取り除き、整数型にcastしている。放送の有無は0か1でフラグを立てる。

#No_visitors
    #get rid of comma in No_visitors
df = df.with_columns(pl.col("No_visitors").str.replace(",", ""))
    #convert from str to int
df = df.with_columns(pl.col("No_visitors").cast(pl.Int32).alias("No_visitors"))

#Internet_TV_broadcasting
df = df.with_columns(pl.when(pl.col("Internet_TV_broadcasting").is_null()).then(0).otherwise(1).alias("Internet_TV_broadcasting"))
df.select(pl.col("Internet_TV_broadcasting")).unique()
df.select(pl.col("Internet_TV_broadcasting")).n_unique()

最終整形

最後に、必要な行のみを抜き出して最終的なデータフレームとする。

df = df.select(
    [
    pl.col("Date"),
    pl.col("day2_2").alias("day_of_the_week"),
    pl.col("nholiday"),
    pl.col("section2_1").alias("section"),
    pl.col("section2_2").alias("section_day"),
    pl.col("home"), 
    pl.col("home_score"),
    pl.col("away"),
    pl.col("away_score"),
    pl.col("PK"),
    pl.col("stadium"), 
    pl.col("No_visitors"),
    pl.col("Internet_TV_broadcasting"),
    pl.col("J1")
    ]
)

pythonの新データフレームlibrary、Polarsについて
 pythonのデータフレームLibrary PolarsのExpressionについて理解を深める
 PythonでのスクレイピングとCSVファイルへ書き込み