MacHu-GWU
12/9/2016 - 10:51 PM

outbrain-data-explore-report.rst

.. image:: https://img.shields.io/badge/People-Zi_Ran_Feng-red.svg
.. image:: https://img.shields.io/badge/People-Yu_Chen_Li-red.svg
.. image:: https://img.shields.io/badge/language-Python34-brightgreen.svg
.. image:: https://img.shields.io/badge/database-sqlite-brightgreen.svg


Outbrain data explore report
============================

- `About the data <https://www.kaggle.com/c/outbrain-click-prediction/data>`_.
- `github <https://github.com/MacHu-GWU/outbrain-project>`_, currently private.

Understand the data
-------------------
- **scale**: find how many rows, how much disk space it cost.
- **primary_key**: find those primary objects, like document, advertisement, event, user, etc, ...
- **cardinality**: find unique number of values for each data points.
- **relationship**: find relationship between each pair of data points, could be one-to-one, one-to-many, many-to-many
- **data quality**: find how many missing values

We simple read the first 1000 rows and count lines for each csv file, and write everything into a single excel file. Now browse data is easy.

.. code-block:: python

	#!/usr/bin/env python
	# -*- coding: utf-8 -*-

	import numpy as np, pandas as pd
	from pathlib_mate import Path
	from sfm import lines_count
	from outbrain import config, database
	from outbrain.database import engine, metadata
	from outbrain.logger import logger

	def merge_everything_to_schema():
	    """
	    
	    **中文文档**
	    
	    汇总所有csv文件的Schema。
	    
	    每个文件读取前1000行,以及统计总行数。然后将多个文件的信息汇总到一个
	    Excel表中, 以供方便观察Schema。
	    """
	    logger.info("Run merge_everything_to_schema()")
	    writer = pd.ExcelWriter("Schema.xlsx")
	    # select all .csv file
	    for p in Path(config.data_dir).select_by_ext(".csv"):
	        logger.info("process %s ..." % p.basename, 1)
	        lines = lines_count.count_lines(p.abspath) - 1 # count lines
	        sheet_name = p.fname + "_%s" % lines
	        df = pd.read_csv(p.abspath, nrows=1000) # fetch first 1000 rows
	        df.to_excel(writer, sheet_name, index=False) # create a spreadsheet
	        
	    writer.save() # write to excel
    
	merge_everything_to_schema()

Explore the data
----------------
We noticed that the ``document_id`` is associated with ``ad_id`` and ``click_status`` via ``display_id`` in ``events.csv``. And each document has ``category``, ``topic``, ``entity`` and other ``meta_data``.

We believe the relationship is very important, and we also want to take advantage from database to boost our data-selection. Here we have some code to find out.

.. code-block:: python

	def investigate_documents_categories_csv():
	    """Find out the document_id vs category_id is many-to-many.
	    
	    **中文文档**
	    
	    检查document_id和category_id的关系:
	    
	    结论: 多对多的关系。
	    """
	    p = Path("raw_data", "documents_categories.csv")
	    df = pd.read_csv(p.abspath, usecols=["document_id", "category_id"])
	    n_rows = df.shape[0]
	    n_document_id = len(df["document_id"].unique())
	    n_category_id = len(df["category_id"].unique())
	    print(n_rows, n_document_id, n_category_id) # 5481475 2828649 97
	    
	investigate_documents_categories_csv()

``ad_id`` exists in ``clicks_train.csv``, ``clicks_test.csv`` and ``promoted_content.csv``, and ``promoted_content.csv`` provides rich meta data. Is meta data is available for all ``ad_id`` in train and test data? The script tells me "YES"!

.. code-block:: python

	def investigate_ad_id():
	    """
	    
	    **中文文档**
	    
	    由于在clicks_train, clicks_test和promoted_content三个文件中都有ad_id。
	    promoted_content中的ad_id是unique的, 而另外两个文件中的不是。所以我们要检查
	    clicks_train, clicks_test两个文件中的ad_id的集合是否有promoted_content中
	    没有的元素。
	    
	    结论: promoted_content中的ad_id就是所有的ad_id。
	    """
	    p = Path(config.data_dir, "clicks_train.csv")
	    df_train = pd.read_csv(p.abspath, usecols=[1,])
	    
	    p = Path(config.data_dir, "clicks_test.csv")
	    df_test = pd.read_csv(p.abspath, usecols=[1,])
	    
	    p = Path(config.data_dir, "promoted_content.csv")
	    df_promoted = pd.read_csv(p.abspath, usecols=[0,])
	    
	    s = set.union(set(df_train["ad_id"]), set(df_test["ad_id"])).\
	        difference(set(df_promoted["ad_id"]))
	        
	    print(len(s)) # 0
	    
	investigate_ad_id()


Baseline Model
--------------
Since the clicks_trains provides the naive probability that an ``ad_id`` will be clicked. How about for any ``display_id``, we order the ``ad_id`` by its probability? Let's have a try:

.. code-block:: python

	#!/usr/bin/env python
	# -*- coding: utf-8 -*-

	import numpy as np, pandas as pd
	from pathlib_mate import Path
	from sfm.timer import Timer
	from sfm.pandas_mate import IOTool
	from outbrain import config, database
	from outbrain.logger import logger


	def naive_guess_based_on_probability():
	    """

	    Runtime 5 minutes.
	    """
	    table = dict()
	    chunksize = 1000000
	    
	    with Timer():
	        logger.info("calculate probability ...")
	        p = Path(config.data_dir, "clicks_train.csv")
	        for _, ad_id, click in IOTool.csv_tuple_iterator(p.abspath, chunksize):
	            if click:
	                try:
	                    table[ad_id]["all"] += 1
	                    table[ad_id]["clicked"] += 1
	                except:
	                    table[ad_id] = {"all": 1, "clicked": 1}
	            else:
	                try:
	                    table[ad_id]["all"] += 1
	                except:
	                    table[ad_id] = {"all": 1, "clicked": 0}
	    
	    for _, d in table.items():
	        d["prob"] = d["clicked"] * 1.0 / d["all"]
	    
	    with Timer():
	        logger.info("group by display_id ...")
	        display = dict()
	        p = Path(config.data_dir, "clicks_test.csv")
	        for display_id, ad_id in IOTool.csv_tuple_iterator(p.abspath, chunksize):
	            try:
	                display[display_id].append(ad_id)
	            except:
	                display[display_id] = [ad_id,]
	    
	    with Timer():
	        logger.info("choose best ad_id ...")
	        result = list()
	        for display_id, ad_id_list in display.items():
	            prob_table = [(ad_id, table.get(ad_id, {"prob": 0.0})["prob"]) for ad_id in ad_id_list]
	            best_ad_id, best_prob = list(sorted(prob_table, key=lambda x: x[1], reverse=-1))[0]
	            result.append((
	                display_id, 
	                best_ad_id, 
	                best_prob, 
	                ", ".join([str(ad_id) for ad_id in ad_id_list])
	            ))
	            
	        df = pd.DataFrame(result, columns=["display_id", "best_ad_id", "best_prob", "all_ad_id"])
	    
	    with Timer():
	        logger.info("export result ...")
	        df.to_csv("result.csv", index=False)
	        
	naive_guess_based_on_probability()