.. image:: https://img.shields.io/badge/People-Zi_Ran_Feng-red.svg
.. image:: https://img.shields.io/badge/People-Yu_Chen_Li-red.svg
.. image:: https://img.shields.io/badge/language-Python34-brightgreen.svg
.. image:: https://img.shields.io/badge/database-sqlite-brightgreen.svg
Outbrain data explore report
============================
- `About the data <https://www.kaggle.com/c/outbrain-click-prediction/data>`_.
- `github <https://github.com/MacHu-GWU/outbrain-project>`_, currently private.
Understand the data
-------------------
- **scale**: find how many rows, how much disk space it cost.
- **primary_key**: find those primary objects, like document, advertisement, event, user, etc, ...
- **cardinality**: find unique number of values for each data points.
- **relationship**: find relationship between each pair of data points, could be one-to-one, one-to-many, many-to-many
- **data quality**: find how many missing values
We simple read the first 1000 rows and count lines for each csv file, and write everything into a single excel file. Now browse data is easy.
.. code-block:: python
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import numpy as np, pandas as pd
from pathlib_mate import Path
from sfm import lines_count
from outbrain import config, database
from outbrain.database import engine, metadata
from outbrain.logger import logger
def merge_everything_to_schema():
"""
**中文文档**
汇总所有csv文件的Schema。
每个文件读取前1000行,以及统计总行数。然后将多个文件的信息汇总到一个
Excel表中, 以供方便观察Schema。
"""
logger.info("Run merge_everything_to_schema()")
writer = pd.ExcelWriter("Schema.xlsx")
# select all .csv file
for p in Path(config.data_dir).select_by_ext(".csv"):
logger.info("process %s ..." % p.basename, 1)
lines = lines_count.count_lines(p.abspath) - 1 # count lines
sheet_name = p.fname + "_%s" % lines
df = pd.read_csv(p.abspath, nrows=1000) # fetch first 1000 rows
df.to_excel(writer, sheet_name, index=False) # create a spreadsheet
writer.save() # write to excel
merge_everything_to_schema()
Explore the data
----------------
We noticed that the ``document_id`` is associated with ``ad_id`` and ``click_status`` via ``display_id`` in ``events.csv``. And each document has ``category``, ``topic``, ``entity`` and other ``meta_data``.
We believe the relationship is very important, and we also want to take advantage from database to boost our data-selection. Here we have some code to find out.
.. code-block:: python
def investigate_documents_categories_csv():
"""Find out the document_id vs category_id is many-to-many.
**中文文档**
检查document_id和category_id的关系:
结论: 多对多的关系。
"""
p = Path("raw_data", "documents_categories.csv")
df = pd.read_csv(p.abspath, usecols=["document_id", "category_id"])
n_rows = df.shape[0]
n_document_id = len(df["document_id"].unique())
n_category_id = len(df["category_id"].unique())
print(n_rows, n_document_id, n_category_id) # 5481475 2828649 97
investigate_documents_categories_csv()
``ad_id`` exists in ``clicks_train.csv``, ``clicks_test.csv`` and ``promoted_content.csv``, and ``promoted_content.csv`` provides rich meta data. Is meta data is available for all ``ad_id`` in train and test data? The script tells me "YES"!
.. code-block:: python
def investigate_ad_id():
"""
**中文文档**
由于在clicks_train, clicks_test和promoted_content三个文件中都有ad_id。
promoted_content中的ad_id是unique的, 而另外两个文件中的不是。所以我们要检查
clicks_train, clicks_test两个文件中的ad_id的集合是否有promoted_content中
没有的元素。
结论: promoted_content中的ad_id就是所有的ad_id。
"""
p = Path(config.data_dir, "clicks_train.csv")
df_train = pd.read_csv(p.abspath, usecols=[1,])
p = Path(config.data_dir, "clicks_test.csv")
df_test = pd.read_csv(p.abspath, usecols=[1,])
p = Path(config.data_dir, "promoted_content.csv")
df_promoted = pd.read_csv(p.abspath, usecols=[0,])
s = set.union(set(df_train["ad_id"]), set(df_test["ad_id"])).\
difference(set(df_promoted["ad_id"]))
print(len(s)) # 0
investigate_ad_id()
Baseline Model
--------------
Since the clicks_trains provides the naive probability that an ``ad_id`` will be clicked. How about for any ``display_id``, we order the ``ad_id`` by its probability? Let's have a try:
.. code-block:: python
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import numpy as np, pandas as pd
from pathlib_mate import Path
from sfm.timer import Timer
from sfm.pandas_mate import IOTool
from outbrain import config, database
from outbrain.logger import logger
def naive_guess_based_on_probability():
"""
Runtime 5 minutes.
"""
table = dict()
chunksize = 1000000
with Timer():
logger.info("calculate probability ...")
p = Path(config.data_dir, "clicks_train.csv")
for _, ad_id, click in IOTool.csv_tuple_iterator(p.abspath, chunksize):
if click:
try:
table[ad_id]["all"] += 1
table[ad_id]["clicked"] += 1
except:
table[ad_id] = {"all": 1, "clicked": 1}
else:
try:
table[ad_id]["all"] += 1
except:
table[ad_id] = {"all": 1, "clicked": 0}
for _, d in table.items():
d["prob"] = d["clicked"] * 1.0 / d["all"]
with Timer():
logger.info("group by display_id ...")
display = dict()
p = Path(config.data_dir, "clicks_test.csv")
for display_id, ad_id in IOTool.csv_tuple_iterator(p.abspath, chunksize):
try:
display[display_id].append(ad_id)
except:
display[display_id] = [ad_id,]
with Timer():
logger.info("choose best ad_id ...")
result = list()
for display_id, ad_id_list in display.items():
prob_table = [(ad_id, table.get(ad_id, {"prob": 0.0})["prob"]) for ad_id in ad_id_list]
best_ad_id, best_prob = list(sorted(prob_table, key=lambda x: x[1], reverse=-1))[0]
result.append((
display_id,
best_ad_id,
best_prob,
", ".join([str(ad_id) for ad_id in ad_id_list])
))
df = pd.DataFrame(result, columns=["display_id", "best_ad_id", "best_prob", "all_ad_id"])
with Timer():
logger.info("export result ...")
df.to_csv("result.csv", index=False)
naive_guess_based_on_probability()