EDA: EMT dataset 🚦

INESDATA-MOV

Análisis EDA¶

En este cuaderno se realiza el análisis de datos exploratorio (EDA) del dataset de la EMT. Una vez realizado el análisis de calidad se estudiará para el dataset resultante:

Valores nulos y unicidad de las variables
Correlaciones entre las variables
Correlaciones entre las variables y la variable de estudio estimateArrive

Con ello se pretende hacer un filtrado de las variables que no dispongan de información relevante para el modelo o variables con la misma información. Por otro lado se hará un estudio de los valores nulos para, en caso de que sea pertinente, reconstruir estos datos.

In [1]:

Copied!





import os
import statistics
import random
import polars as pl
import pandas as pd
import seaborn as sns
import polars.selectors as cs
import matplotlib.pyplot as plt
from datetime import datetime, timedelta

import warnings

warnings.filterwarnings("ignore")
import os
import statistics
import random
import polars as pl
import pandas as pd
import seaborn as sns
import polars.selectors as cs
import matplotlib.pyplot as plt
from datetime import datetime, timedelta

import warnings

warnings.filterwarnings("ignore")

In [2]:

Copied!

ROOT_PATH = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(os.getcwd()))))
DATA_PATH = os.path.join(ROOT_PATH, "data", "processed")
EMT_DATA_PATH = os.path.join(DATA_PATH, "emt")
ROOT_PATH = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(os.getcwd()))))
DATA_PATH = os.path.join(ROOT_PATH, "data", "processed")
EMT_DATA_PATH = os.path.join(DATA_PATH, "emt")

Dataset¶

In [3]:

Copied!

sample_data = pl.scan_csv(os.path.join(EMT_DATA_PATH, "2024", "03","13", f"emt_20240313.csv"))
sample_data = pl.scan_csv(os.path.join(EMT_DATA_PATH, "2024", "03","13", f"emt_20240313.csv"))

In [4]:

Copied!

sample_data.collect()
sample_data.collect()

Out[4]:

shape: (1_134_462, 19)

date	datetime	bus	line	stop	positionBusLon	positionBusLat	positionTypeBus	DistanceBus	destination	deviation	StartTime	StopTime	MinimunFrequency	MaximumFrequency	isHead	dayType	strike	estimateArrive
str	str	i64	str	i64	f64	f64	i64	i64	str	i64	str	str	i64	i64	bool	str	str	i64
"2024-03-13"	"2024-03-13 08:…	513	"27"	56	-3.690542	40.423739	0	1841	"PLAZA CASTILLA…	0	"05:55"	"23:30"	3	12	false	"LA"	"N"	473
"2024-03-13"	"2024-03-13 08:…	521	"27"	56	-3.689019	40.429011	0	1221	"PLAZA CASTILLA…	0	"05:55"	"23:30"	3	12	false	"LA"	"N"	313
"2024-03-13"	"2024-03-13 08:…	2549	"150"	56	-3.691608	40.421366	0	2081	"VIRGEN CORTIJO…	0	null	null	null	null	false	null	null	547
"2024-03-13"	"2024-03-13 08:…	2561	"150"	56	-3.698411	40.418013	0	2779	"VIRGEN CORTIJO…	0	null	null	null	null	false	null	null	1080
"2024-03-13"	"2024-03-13 08:…	5571	"14"	56	-3.687918	40.43285	0	679	"PIO XII"	0	null	null	null	null	false	null	null	197
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
"2024-03-13"	"2024-03-13 22:…	2141	"174"	51023	-3.667048	40.489136	0	8114	"VALDEBEBAS"	0	"06:00"	"23:45"	7	22	false	"LA"	"N"	1225
"2024-03-13"	"2024-03-13 22:…	8810	"171"	51023	0.0	0.0	0	13290	"VALDEBEBAS"	0	null	null	null	null	false	null	null	2163
"2024-03-13"	"2024-03-13 22:…	8829	"171"	51023	0.0	0.0	0	2088	"VALDEBEBAS"	0	null	null	null	null	false	null	null	396
"2024-03-13"	"2024-03-13 22:…	2290	"174"	3256	-3.623092	40.48233	0	13847	"VALDEBEBAS"	0	"06:00"	"23:45"	7	22	false	"LA"	"N"	1960
"2024-03-13"	"2024-03-13 22:…	8879	"174"	3256	-3.676137	40.46962	0	4891	"VALDEBEBAS"	0	"06:00"	"23:45"	7	22	false	"LA"	"N"	746

Variables numéricas¶

In [5]:

Copied!

sample_data.select(cs.numeric()).head().collect()
sample_data.select(cs.numeric()).head().collect()

Out[5]:

shape: (5, 10)

bus	stop	positionBusLon	positionBusLat	positionTypeBus	DistanceBus	deviation	MinimunFrequency	MaximumFrequency	estimateArrive
i64	i64	f64	f64	i64	i64	i64	i64	i64	i64
513	56	-3.690542	40.423739	0	1841	0	3	12	473
521	56	-3.689019	40.429011	0	1221	0	3	12	313
2549	56	-3.691608	40.421366	0	2081	0	null	null	547
2561	56	-3.698411	40.418013	0	2779	0	null	null	1080
5571	56	-3.687918	40.43285	0	679	0	null	null	197

Variables categóricas¶

In [6]:

Copied!

sample_data.select(cs.string()).head().collect()
sample_data.select(cs.string()).head().collect()

Out[6]:

shape: (5, 8)

date	datetime	line	destination	StartTime	StopTime	dayType	strike
str	str	str	str	str	str	str	str
"2024-03-13"	"2024-03-13 08:…	"27"	"PLAZA CASTILLA…	"05:55"	"23:30"	"LA"	"N"
"2024-03-13"	"2024-03-13 08:…	"27"	"PLAZA CASTILLA…	"05:55"	"23:30"	"LA"	"N"
"2024-03-13"	"2024-03-13 08:…	"150"	"VIRGEN CORTIJO…	null	null	null	null
"2024-03-13"	"2024-03-13 08:…	"150"	"VIRGEN CORTIJO…	null	null	null	null
"2024-03-13"	"2024-03-13 08:…	"14"	"PIO XII"	null	null	null	null

Preprocesamiento previo¶

Creamos la PK¶

In [7]:

Copied!

sample_data = sample_data.with_columns((pl.col('datetime').cast(pl.String)+"_B"+pl.col('bus').cast(pl.String)+"_L"+ pl.col('line').cast(pl.String)+"_S"+pl.col('stop').cast(pl.String)).alias('PK'))
sample_data.head().collect()
sample_data = sample_data.with_columns((pl.col('datetime').cast(pl.String)+"_B"+pl.col('bus').cast(pl.String)+"_L"+ pl.col('line').cast(pl.String)+"_S"+pl.col('stop').cast(pl.String)).alias('PK'))
sample_data.head().collect()

Out[7]:

shape: (5, 20)

date	datetime	bus	line	stop	positionBusLon	positionBusLat	positionTypeBus	DistanceBus	destination	deviation	StartTime	StopTime	MinimunFrequency	MaximumFrequency	isHead	dayType	strike	estimateArrive	PK
str	str	i64	str	i64	f64	f64	i64	i64	str	i64	str	str	i64	i64	bool	str	str	i64	str
"2024-03-13"	"2024-03-13 08:…	513	"27"	56	-3.690542	40.423739	0	1841	"PLAZA CASTILLA…	0	"05:55"	"23:30"	3	12	false	"LA"	"N"	473	"2024-03-13 08:…
"2024-03-13"	"2024-03-13 08:…	521	"27"	56	-3.689019	40.429011	0	1221	"PLAZA CASTILLA…	0	"05:55"	"23:30"	3	12	false	"LA"	"N"	313	"2024-03-13 08:…
"2024-03-13"	"2024-03-13 08:…	2549	"150"	56	-3.691608	40.421366	0	2081	"VIRGEN CORTIJO…	0	null	null	null	null	false	null	null	547	"2024-03-13 08:…
"2024-03-13"	"2024-03-13 08:…	2561	"150"	56	-3.698411	40.418013	0	2779	"VIRGEN CORTIJO…	0	null	null	null	null	false	null	null	1080	"2024-03-13 08:…
"2024-03-13"	"2024-03-13 08:…	5571	"14"	56	-3.687918	40.43285	0	679	"PIO XII"	0	null	null	null	null	false	null	null	197	"2024-03-13 08:…

PK no única

In [8]:

Copied!

sample_data.group_by(pl.col('PK')).count().sort('count',descending = True).head().collect()
sample_data.group_by(pl.col('PK')).count().sort('count',descending = True).head().collect()

Out[8]:

shape: (5, 2)

PK	count
str	u32
"2024-03-13 22:…	2
"2024-03-13 22:…	2
"2024-03-13 22:…	2
"2024-03-13 22:…	2
"2024-03-13 22:…	2

Eliminamos los valores incorrectos `estimateArrive`¶

In [9]:

Copied!

sample_data.group_by(pl.col('estimateArrive')).count().sort('estimateArrive',descending=True).head().collect()
sample_data.group_by(pl.col('estimateArrive')).count().sort('estimateArrive',descending=True).head().collect()

Out[9]:

shape: (5, 2)

estimateArrive	count
i64	u32
999999	2558
888888	753
5389	1
5343	1
5339	1

Filtramos por un ETA < 88888 ya que es un valor de tiempo erróneo

In [10]:

Copied!

sample_data = sample_data.filter(pl.col('estimateArrive')<888888)
sample_data = sample_data.filter(pl.col('estimateArrive')<888888)

Comprobamos que la PK es única

In [11]:

Copied!

sample_data.group_by(pl.col('PK')).count().sort('count',descending = True).head().collect()
sample_data.group_by(pl.col('PK')).count().sort('count',descending = True).head().collect()

Out[11]:

shape: (5, 2)

PK	count
str	u32
"2024-03-13 22:…	2
"2024-03-13 10:…	2
"2024-03-13 10:…	2
"2024-03-13 10:…	2
"2024-03-13 10:…	2

Sigue habiendo datos duplicados, por lo que para esos casos, cogeremos el que tenga menor ETA

In [12]:

Copied!

sample_data = sample_data.group_by('PK').min()
sample_data = sample_data.group_by('PK').min()

In [13]:

Copied!

sample_data.group_by('PK').count().sort('count',descending=True).head().collect()
sample_data.group_by('PK').count().sort('count',descending=True).head().collect()

Out[13]:

shape: (5, 2)

PK	count
str	u32
"2024-03-13 15:…	1
"2024-03-13 17:…	1
"2024-03-13 17:…	1
"2024-03-13 12:…	1
"2024-03-13 14:…	1

Matriz de confusión¶

Pasamos datetime a formato fecha

In [14]:

Copied!

sample_data = sample_data.with_columns(pl.col('datetime').map_elements(lambda x: datetime.strptime(x, "%Y-%m-%d %H:%M:%S.%f")))
sample_data = sample_data.with_columns(pl.col('datetime').map_elements(lambda x: datetime.strptime(x, "%Y-%m-%d %H:%M:%S.%f")))

Pasamos date a formato fecha

In [15]:

Copied!

sample_data = sample_data.with_columns(pl.col("date").cast(pl.Date))
sample_data = sample_data.with_columns(pl.col("date").cast(pl.Date))

Pasamos la variable isHead a numérica

In [16]:

Copied!

sample_data = sample_data.with_columns(pl.col('isHead').cast(pl.UInt8))
sample_data = sample_data.with_columns(pl.col('isHead').cast(pl.UInt8))

In [17]:

Copied!

sample_data.head().collect()
sample_data.head().collect()

Out[17]:

shape: (5, 20)

PK	date	datetime	bus	line	stop	positionBusLon	positionBusLat	positionTypeBus	DistanceBus	destination	deviation	StartTime	StopTime	MinimunFrequency	MaximumFrequency	isHead	dayType	strike	estimateArrive
str	date	datetime[μs]	i64	str	i64	f64	f64	i64	i64	str	i64	str	str	i64	i64	u8	str	str	i64
"2024-03-13 13:…	2024-03-13	2024-03-13 13:57:02.921708	5575	"14"	54	-3.692133	40.420172	0	2227	"PIO XII"	0	null	null	null	null	0	null	null	844
"2024-03-13 21:…	2024-03-13	2024-03-13 21:53:03.331814	5667	"40"	52	-3.697028	40.432398	0	4060	"ALFONSO XIII"	0	null	null	null	null	0	null	null	1406
"2024-03-13 13:…	2024-03-13	2024-03-13 13:41:03.386037	4793	"45"	72	-3.694743	40.392751	0	2869	"REINA VICTORIA…	0	null	null	null	null	0	null	null	1085
"2024-03-13 13:…	2024-03-13	2024-03-13 13:56:54.781069	521	"27"	54	-3.689548	40.43714	0	612	"PLAZA CASTILLA…	0	"05:55"	"23:30"	3	12	0	"LA"	"N"	194
"2024-03-13 18:…	2024-03-13	2024-03-13 18:24:05.996058	2468	"134"	4968	-3.713937	40.483179	0	1207	"MONTECARMELO"	0	"06:25"	"23:45"	9	26	0	"LA"	"N"	283

In [18]:

Copied!





numeric_var = ['bus', 'line','stop',
 'positionBusLon',
 'positionBusLat',
 'DistanceBus',
 'deviation',
 'MinimunFrequency',
 'MaximumFrequency',
 'isHead',
 'estimateArrive',]
numeric_var = ['bus', 'line','stop',
 'positionBusLon',
 'positionBusLat',
 'DistanceBus',
 'deviation',
 'MinimunFrequency',
 'MaximumFrequency',
 'isHead',
 'estimateArrive',]

In [19]:

Copied!

sample_data_df = sample_data.collect()
sample_data_df = sample_data.collect()

In [20]:

Copied!





matrix = [[0] * len(numeric_var) for _ in range(len(numeric_var))]
for i in range(0,len(numeric_var)):
    for j in range(0,len(numeric_var)):
        matrix[i][j] = sample_data_df.select(pl.corr(numeric_var[i],numeric_var[j])).item()
matrix = [[0] * len(numeric_var) for _ in range(len(numeric_var))]
for i in range(0,len(numeric_var)):
    for j in range(0,len(numeric_var)):
        matrix[i][j] = sample_data_df.select(pl.corr(numeric_var[i],numeric_var[j])).item()

In [21]:

Copied!

matrix_corr = pd.DataFrame(matrix, columns=numeric_var,index=numeric_var).apply(lambda col: col.round(2))
matrix_corr = pd.DataFrame(matrix, columns=numeric_var,index=numeric_var).apply(lambda col: col.round(2))

In [22]:

Copied!

matrix_corr
matrix_corr

Out[22]:

	bus	line	stop	positionBusLon	positionBusLat	DistanceBus	deviation	MinimunFrequency	MaximumFrequency	isHead	estimateArrive
bus	1.00	0.04	0.11	0.27	-0.27	0.00	0.09	0.10	0.28	0.01	0.07
line	0.04	1.00	0.24	-0.08	0.09	0.27	0.01	0.43	0.39	0.08	0.12
stop	0.11	0.24	1.00	0.08	-0.07	0.13	-0.02	0.01	0.07	0.26	0.06
positionBusLon	0.27	-0.08	0.08	1.00	-1.00	0.12	-0.01	0.02	-0.02	-0.02	0.11
positionBusLat	-0.27	0.09	-0.07	-1.00	1.00	-0.11	0.01	0.02	0.02	0.02	-0.10
DistanceBus	0.00	0.27	0.13	0.12	-0.11	1.00	0.00	0.24	0.22	-0.06	0.86
deviation	0.09	0.01	-0.02	-0.01	0.01	0.00	1.00	-0.00	-0.00	-0.02	0.02
MinimunFrequency	0.10	0.43	0.01	0.02	0.02	0.24	-0.00	1.00	0.78	0.03	0.30
MaximumFrequency	0.28	0.39	0.07	-0.02	0.02	0.22	-0.00	0.78	1.00	0.04	0.26
isHead	0.01	0.08	0.26	-0.02	0.02	-0.06	-0.02	0.03	0.04	1.00	0.02
estimateArrive	0.07	0.12	0.06	0.11	-0.10	0.86	0.02	0.30	0.26	0.02	1.00

In [23]:

Copied!

sns.heatmap(matrix_corr, cmap='coolwarm',vmin=-1, vmax=1,annot= True)
sns.heatmap(matrix_corr, cmap='coolwarm',vmin=-1, vmax=1,annot= True)

Out[23]:

<Axes: >

No description has been provided for this image

In [24]:

Copied!

matrix_corr.to_csv('/home/mlia/proyectos/data-generation/docs/notebooks/aux/matrix_corr_emt.csv', index=False)
matrix_corr.to_csv('/home/mlia/proyectos/data-generation/docs/notebooks/aux/matrix_corr_emt.csv', index=False)

Obtenemos que las variables más correladas entre sí son:

positionBusLon y positionBusLat
MinimunFrequency y MaximumFrequency

Además la variable más correlada con estimateArrive es DistanceBus

Estudio por variables¶

Para cada variable:

Comprobaremos si tiene valores nulos
Calcularemos la correlación con la variable estimateArrive
Dibujaremos el tiempo medio de llegada según sus categorías en el caso en el que fuera posible
Decidiremos si la mantenemos o la eliminamos

In [25]:

Copied!

sample_data.head().describe()
sample_data.head().describe()

Out[25]:

shape: (9, 21)

statistic	PK	date	datetime	bus	line	stop	positionBusLon	positionBusLat	positionTypeBus	DistanceBus	destination	deviation	StartTime	StopTime	MinimunFrequency	MaximumFrequency	isHead	dayType	strike	estimateArrive
str	str	str	str	f64	str	f64	f64	f64	f64	f64	str	f64	str	str	f64	f64	f64	str	str	f64
"count"	"5"	"5"	"5"	5.0	"5"	5.0	5.0	5.0	5.0	5.0	"5"	5.0	"4"	"4"	4.0	4.0	5.0	"4"	"4"	5.0
"null_count"	"0"	"0"	"0"	0.0	"0"	0.0	0.0	0.0	0.0	0.0	"0"	0.0	"1"	"1"	1.0	1.0	0.0	"1"	"1"	0.0
"mean"	null	"2024-03-13"	null	2826.6	null	2255.0	-3.679584	40.464248	0.0	3542.0	null	0.0	null	null	5.75	20.25	0.0	null	null	742.2
"std"	null	null	null	3510.668711	null	2304.700631	0.015201	0.034255	0.0	2000.683633	null	0.0	null	null	1.258306	2.362908	0.0	null	null	544.644104
"min"	"2024-03-13 11:…	"2024-03-13"	null	122.0	"174"	78.0	-3.702037	40.405106	0.0	339.0	"ALSACIA"	0.0	"05:30"	"23:45"	4.0	17.0	0.0	"LA"	"N"	76.0
"25%"	null	"2024-03-13"	null	584.0	null	246.0	-3.685782	40.464995	0.0	3195.0	null	0.0	null	null	6.0	20.0	0.0	null	null	281.0
"50%"	null	"2024-03-13"	null	2071.0	null	1762.0	-3.678519	40.478297	0.0	3891.0	null	0.0	null	null	6.0	22.0	0.0	null	null	860.0
"75%"	null	"2024-03-13"	null	2506.0	null	3794.0	-3.666058	40.483716	0.0	4703.0	null	0.0	null	null	6.0	22.0	0.0	null	null	1190.0
"max"	"2024-03-13 22:…	"2024-03-13"	null	8850.0	"C03"	5395.0	-3.665525	40.489127	0.0	5582.0	"VALDEBEBAS"	0.0	"06:00"	"23:45"	7.0	22.0	0.0	"LA"	"N"	1304.0

Variable date

Valores nulos

In [26]:

Copied!

sample_data.group_by(pl.col('date')).count().head().collect()
sample_data.group_by(pl.col('date')).count().head().collect()

Out[26]:

shape: (1, 2)

date	count
date	u32
2024-03-13	1131139

Variable datetime

Valores nulos

In [27]:

Copied!

sample_data.group_by(pl.col('datetime')).count().head().collect()
sample_data.group_by(pl.col('datetime')).count().head().collect()

Out[27]:

shape: (5, 2)

datetime	count
datetime[μs]	u32
2024-03-13 11:37:57.321267	2
2024-03-13 13:26:54.208612	12
2024-03-13 09:59:29.140672	4
2024-03-13 10:52:58.664917	4
2024-03-13 21:01:54.748638	2

Variable bus

Valores nulos

In [28]:

Copied!

sample_data.group_by(pl.col('bus')).count().head().collect()
sample_data.group_by(pl.col('bus')).count().head().collect()

Out[28]:

shape: (5, 2)

bus	count
i64	u32
3317	1260
4329	59
2266	515
4758	17
548	229

Correlación

In [29]:

Copied!

sample_data.select(pl.corr('bus','estimateArrive')).head().collect()
sample_data.select(pl.corr('bus','estimateArrive')).head().collect()

Out[29]:

shape: (1, 1)

bus
f64
0.06748

Tiempo medio de espera según autobús

In [30]:

Copied!

sample_data.group_by(pl.col('bus')).mean().select(pl.col('bus'),pl.col('estimateArrive')).head().collect()
sample_data.group_by(pl.col('bus')).mean().select(pl.col('bus'),pl.col('estimateArrive')).head().collect()

Out[30]:

shape: (5, 2)

bus	estimateArrive
i64	f64
4755	398.504202
2504	1082.137097
5654	380.661765
560	442.948276
563	528.829897

Según que autobús sea, el tiempo medio de espera varia bastante. Por lo que esta variable va a ser necesaria a la hora de la creación de nuestro modelo.

Variable line

Valores nulos

In [31]:

Copied!

sample_data.group_by(pl.col('line')).count().head().collect()
sample_data.group_by(pl.col('line')).count().head().collect()

Out[31]:

shape: (5, 2)

line	count
str	u32
"167"	3352
"67"	46138
"48"	13481
"173"	23933
"C2"	1650

Correlación

In [32]:

Copied!

sample_data.select(pl.corr('line','estimateArrive')).head().collect()
sample_data.select(pl.corr('line','estimateArrive')).head().collect()

Out[32]:

shape: (1, 1)

line
f64
0.120876

Tiempo medio de espera según línea

In [33]:

Copied!

sample_data.group_by(pl.col('line')).mean().select(pl.col('line'),pl.col('estimateArrive')).collect().plot.bar(x='line',rot=90)
sample_data.group_by(pl.col('line')).mean().select(pl.col('line'),pl.col('estimateArrive')).collect().plot.bar(x='line',rot=90)

Out[33]:

Según que línea sea, el tiempo medio de espera varia bastante. Por lo que esta variable va a ser necesaria a la hora de la creación de nuestro modelo.

Variable stop

Valores nulos

In [34]:

Copied!

sample_data.group_by(pl.col('stop')).count().head().collect()
sample_data.group_by(pl.col('stop')).count().head().collect()

Out[34]:

shape: (5, 2)

stop	count
i64	u32
1745	3446
5919	3435
5803	1707
30	6801
271	1721

Correlación

In [35]:

Copied!

sample_data.select(pl.corr('stop','estimateArrive')).head().collect()
sample_data.select(pl.corr('stop','estimateArrive')).head().collect()

Out[35]:

shape: (1, 1)

stop
f64
0.063389

Tiempo medio de espera según parada

In [36]:

Copied!

sample_data.group_by(pl.col('stop')).mean().select(pl.col('stop'),pl.col('estimateArrive')).head().collect()
sample_data.group_by(pl.col('stop')).mean().select(pl.col('stop'),pl.col('estimateArrive')).head().collect()

Out[36]:

shape: (5, 2)

stop	estimateArrive
i64	f64
5800	804.162099
4493	711.560164
1608	605.882045
5919	773.480058
5803	786.485647

Según que parada sea, el tiempo medio de espera varia bastante. Por lo que esta variable va a ser necesaria a la hora de la creación de nuestro modelo.

Variable positionBusLon

Valores nulos

In [37]:

Copied!

sample_data.group_by(pl.col('positionBusLon')).count().head().collect()
sample_data.group_by(pl.col('positionBusLon')).count().head().collect()

Out[37]:

shape: (5, 2)

positionBusLon	count
f64	u32
-3.697979	3
-3.647153	1
-3.6347	7
-3.711168	2
-3.713401	23

Correlación

In [38]:

Copied!

sample_data.select(pl.corr('positionBusLon','estimateArrive')).head().collect()
sample_data.select(pl.corr('positionBusLon','estimateArrive')).head().collect()

Out[38]:

shape: (1, 1)

positionBusLon
f64
0.10522

Variable positionBusLat

Valores nulos

In [39]:

Copied!

sample_data.group_by(pl.col('positionBusLat')).count().head().collect()
sample_data.group_by(pl.col('positionBusLat')).count().head().collect()

Out[39]:

shape: (5, 2)

positionBusLat	count
f64	u32
40.481868	9
40.467639	17
40.424176	2
40.469551	24
40.475646	9

Correlación

In [40]:

Copied!

sample_data.select(pl.corr('positionBusLat','estimateArrive')).head().collect()
sample_data.select(pl.corr('positionBusLat','estimateArrive')).head().collect()

Out[40]:

shape: (1, 1)

positionBusLat
f64
-0.103452

Variable positionTypeBus

Valores nulos

In [41]:

Copied!

sample_data.group_by(pl.col('positionTypeBus')).count().head().collect()
sample_data.group_by(pl.col('positionTypeBus')).count().head().collect()

Out[41]:

shape: (3, 2)

positionTypeBus	count
i64	u32
1	11
5	1356
0	1129772

Correlación

In [42]:

Copied!

sample_data.select(pl.corr('positionTypeBus','estimateArrive')).head().collect()
sample_data.select(pl.corr('positionTypeBus','estimateArrive')).head().collect()

Out[42]:

shape: (1, 1)

positionTypeBus
f64
0.028544

Como no tenemos información acerca del significado de esta variable, no podemos entenderla por lo que decidimos eliminarla

In [43]:

Copied!

sample_data = sample_data.drop('positionBusType')
sample_data = sample_data.drop('positionBusType')

Variable DistanceBus

Valores nulos

In [44]:

Copied!

sample_data.group_by(pl.col('DistanceBus')).count().head().collect()
sample_data.group_by(pl.col('DistanceBus')).count().head().collect()

Out[44]:

shape: (5, 2)

DistanceBus	count
i64	u32
10888	3
5651	48
6443	37
2385	248
670	284

Correlación

In [45]:

Copied!

sample_data.select(pl.corr('DistanceBus','estimateArrive')).head().collect()
sample_data.select(pl.corr('DistanceBus','estimateArrive')).head().collect()

Out[45]:

shape: (1, 1)

DistanceBus
f64
0.857959

Como era de esperar, es la variable que mayor correlación tiene con el ETA

Variable destination

Valores nulos

In [46]:

Copied!

sample_data.group_by(pl.col('destination')).count().head().collect()
sample_data.group_by(pl.col('destination')).count().head().collect()

Out[46]:

shape: (5, 2)

destination	count
str	u32
"TELEFONICA"	5120
"SANCHINARRO"	37472
"BARRIO DEL PIL…	54330
"PLAZA CATALUÑA…	1672
"REINA VICTORIA…	18282

Tiempo medio de espera según destino

In [47]:

Copied!

sample_data.group_by(pl.col('destination')).mean().select(pl.col('destination'),pl.col('estimateArrive')).collect().plot.bar(x='destination',rot=90)
sample_data.group_by(pl.col('destination')).mean().select(pl.col('destination'),pl.col('estimateArrive')).collect().plot.bar(x='destination',rot=90)

Out[47]:

Según que destino sea, el tiempo medio de espera varia bastante. Por lo que esta variable va a ser necesaria a la hora de la creación de nuestro modelo.

Variable deviation

Valores nulos

In [48]:

Copied!

sample_data.group_by(pl.col('deviation')).count().head().collect()
sample_data.group_by(pl.col('deviation')).count().head().collect()

Out[48]:

shape: (5, 2)

deviation	count
i64	u32
137	6
0	1123941
5261	3092
537	38
8585	4044

Correlación

In [49]:

Copied!

sample_data.select(pl.corr('deviation','estimateArrive')).head().collect()
sample_data.select(pl.corr('deviation','estimateArrive')).head().collect()

Out[49]:

shape: (1, 1)

deviation
f64
0.022565

In [50]:

Copied!

sample_data.group_by(pl.col('deviation')).mean().select(pl.col('deviation'),pl.col('estimateArrive')).sort('deviation', descending=True).collect()
sample_data.group_by(pl.col('deviation')).mean().select(pl.col('deviation'),pl.col('estimateArrive')).sort('deviation', descending=True).collect()

Out[50]:

shape: (7, 2)

deviation	estimateArrive
i64	f64
8585	799.088279
5261	711.674968
1259	367.625
537	463.5
288	517.5
137	512.833333
0	637.801457

No tenemos información acerca del significado de esta variable. Por lo que la eliminamos también

In [51]:

Copied!

sample_data = sample_data.drop('deviation')
sample_data = sample_data.drop('deviation')

Variable StartTime

Valores nulos

In [52]:

Copied!

sample_data.group_by(pl.col('StartTime')).count().sort(pl.col('StartTime'),descending=True).collect()
sample_data.group_by(pl.col('StartTime')).count().sort(pl.col('StartTime'),descending=True).collect()

Out[52]:

shape: (9, 2)

StartTime	count
str	u32
null	533127
"06:30"	13514
"06:25"	54669
"06:20"	23954
"06:15"	41060
"06:10"	61758
"06:00"	306036
"05:55"	47464
"05:30"	49557

Variable StopTime

Valores nulos

In [53]:

Copied!

sample_data.group_by(pl.col('StopTime')).count().sort(pl.col('StopTime'),descending=True).collect()
sample_data.group_by(pl.col('StopTime')).count().sort(pl.col('StopTime'),descending=True).collect()

Out[53]:

shape: (3, 2)

StopTime	count
str	u32
null	533127
"23:45"	526594
"23:30"	71418

La información que nos dan las dos variables anteriores refleja que estamos considerando solo autobuses diurnos e ignorando los nocturnos. Por tanto, esta variable no va a inferir en el tiempo de estimación ya que el hecho de que su horario comience a las 6 de la mañana o a las 7 de la mañana no va a depender de que tarde más o menos a lo largo del día. Por tanto consideramos que se pueden borrar también.

In [54]:

Copied!

sample_data = sample_data.drop('StartTime','StopTime')
sample_data = sample_data.drop('StartTime','StopTime')

Variable MinimunFrequency

Valores nulos

In [55]:

Copied!

sample_data.group_by(pl.col('MinimunFrequency')).count().head().collect()
sample_data.group_by(pl.col('MinimunFrequency')).count().head().collect()

Out[55]:

shape: (5, 2)

MinimunFrequency	count
i64	u32
null	533127
7	116427
3	47464
6	71872
12	40808

Correlación

In [56]:

Copied!

sample_data.select(pl.corr('MinimunFrequency','estimateArrive')).head().collect()
sample_data.select(pl.corr('MinimunFrequency','estimateArrive')).head().collect()

Out[56]:

shape: (1, 1)

MinimunFrequency
f64
0.302355

Variable MaximumFrequency

Valores nulos

In [57]:

Copied!

sample_data.group_by(pl.col('MaximumFrequency')).count().sort(pl.col('MaximumFrequency'),descending=True).collect()
sample_data.group_by(pl.col('MaximumFrequency')).count().sort(pl.col('MaximumFrequency'),descending=True).collect()

Out[57]:

shape: (11, 2)

MaximumFrequency	count
i64	u32
null	533127
30	46138
29	13514
26	54669
24	102448
…	…
21	23933
20	37704
17	49557
15	61758
12	47464

Correlación

In [58]:

Copied!

sample_data.select(pl.corr('MaximumFrequency','estimateArrive')).head().collect()
sample_data.select(pl.corr('MaximumFrequency','estimateArrive')).head().collect()

Out[58]:

shape: (1, 1)

MaximumFrequency
f64
0.260828

Las variables MinimunFrequency y MaximumFrequency tienen muchos valores nulos pero si que están relacionadas con el ETA. Una opción es mantener solo una de ellas ya que es probable que aporten la misma información. De momento mantenemos las dos.

Variable isHead

Valores nulos

In [59]:

Copied!

sample_data.group_by(pl.col('isHead')).count().head().collect()
sample_data.group_by(pl.col('isHead')).count().head().collect()

Out[59]:

shape: (2, 2)

isHead	count
u8	u32
0	1068067
1	63072

Correlación

In [60]:

Copied!

sample_data.select(pl.corr('isHead','estimateArrive')).head().collect()
sample_data.select(pl.corr('isHead','estimateArrive')).head().collect()

Out[60]:

shape: (1, 1)

isHead
f64
0.018379

Tiempo medio de espera según si el autobús es cabecera o no

In [61]:

Copied!

sample_data.group_by(pl.col('isHead')).mean().select(pl.col('isHead'),pl.col('estimateArrive')).head().collect().plot.bar(x='isHead')
sample_data.group_by(pl.col('isHead')).mean().select(pl.col('isHead'),pl.col('estimateArrive')).head().collect().plot.bar(x='isHead')

Out[61]:

In [62]:

Copied!

sample_data.group_by(pl.col('isHead')).mean().select(pl.col('isHead'),pl.col('estimateArrive')).head().collect()
sample_data.group_by(pl.col('isHead')).mean().select(pl.col('isHead'),pl.col('estimateArrive')).head().collect()

Out[62]:

shape: (2, 2)

isHead	estimateArrive
u8	f64
1	673.029807
0	636.534496

Mantenemos esta variable ya que está bastante relacionada con el ETA

Variable dayType

Valores nulos

In [63]:

Copied!

sample_data.group_by(pl.col('dayType')).count().head().collect()
sample_data.group_by(pl.col('dayType')).count().head().collect()

Out[63]:

shape: (2, 2)

dayType	count
str	u32
"LA"	598012
null	533127

Vemos de que tipo son los días nulos utilizando la fecha

In [64]:

Copied!





def get_type_day(date):
    
    day = date.strftime("%A")
    
    if day in ['Monday','Tuesday','Wednesday','Thursday','Friday']:
        
        type = 'LA'
    elif day == 'Saturday':
        type = 'SA'
    else:
        type = 'FE'
        
    return type
def get_type_day(date):
    
    day = date.strftime("%A")
    
    if day in ['Monday','Tuesday','Wednesday','Thursday','Friday']:
        
        type = 'LA'
    elif day == 'Saturday':
        type = 'SA'
    else:
        type = 'FE'
        
    return type

In [65]:

Copied!

sample_data = sample_data.with_columns(pl.when(pl.col('dayType').is_null()).then(pl.col('date').apply(get_type_day)).otherwise(pl.col('dayType')).alias('dayType'))
sample_data = sample_data.with_columns(pl.when(pl.col('dayType').is_null()).then(pl.col('date').apply(get_type_day)).otherwise(pl.col('dayType')).alias('dayType'))

In [66]:

Copied!

sample_data.group_by(pl.col('dayType')).count().head().collect()
sample_data.group_by(pl.col('dayType')).count().head().collect()

Out[66]:

shape: (1, 2)

dayType	count
str	u32
"LA"	1131139

Variable strike

Valores nulos

In [67]:

Copied!

sample_data.group_by(pl.col('strike')).count().head().collect()
sample_data.group_by(pl.col('strike')).count().head().collect()

Out[67]:

shape: (2, 2)

strike	count
str	u32
null	533127
"N"	598012

In [68]:

Copied!

sample_data.group_by(pl.col('strike')).mean().select(pl.col('strike'),pl.col('estimateArrive')).head().collect()
sample_data.group_by(pl.col('strike')).mean().select(pl.col('strike'),pl.col('estimateArrive')).head().collect()

Out[68]:

shape: (2, 2)

strike	estimateArrive
str	f64
null	687.680016
"N"	594.787466

La variable Strike toma solo 'N' o nulo, por lo que para ningún día se tiene constancia de que hubo huelga. Por tanto para estos datos esta variable no va a aportar información.

In [69]:

Copied!

sample_data = sample_data.drop('strike')
sample_data.head().collect()
sample_data = sample_data.drop('strike')
sample_data.head().collect()

Out[69]:

shape: (5, 16)

PK	date	datetime	bus	line	stop	positionBusLon	positionBusLat	positionTypeBus	DistanceBus	destination	MinimunFrequency	MaximumFrequency	isHead	dayType	estimateArrive
str	date	datetime[μs]	i64	str	i64	f64	f64	i64	i64	str	i64	i64	u8	str	i64
"2024-03-13 12:…	2024-03-13	2024-03-13 12:14:06.145791	5552	"42"	5018	-3.689362	40.467008	0	880	"BARRIO PEÑAGRA…	7	24	0	"LA"	232
"2024-03-13 12:…	2024-03-13	2024-03-13 12:58:02.776095	4757	"19"	89	-3.694772	40.392075	0	1550	"PLAZA CATALUÑA…	null	null	0	"LA"	434
"2024-03-13 10:…	2024-03-13	2024-03-13 10:00:02.709359	4737	"49"	1542	-3.689145	40.467685	0	2618	"PITIS"	4	15	0	"LA"	707
"2024-03-13 15:…	2024-03-13	2024-03-13 15:07:05.141116	2132	"107"	1841	-3.6667	40.469799	0	1855	"HORTALEZA"	11	22	0	"LA"	363
"2024-03-13 15:…	2024-03-13	2024-03-13 15:08:07.085498	4725	"49"	5636	-3.700473	40.468273	0	137	"PITIS"	4	15	0	"LA"	39

Variable estimateArrive

Valores nulos

In [70]:

Copied!

sample_data.group_by(pl.col('estimateArrive')).count().head().collect()
sample_data.group_by(pl.col('estimateArrive')).count().head().collect()

Out[70]:

shape: (5, 2)

estimateArrive	count
i64	u32
1727	78
828	750
2236	13
167	959
423	947

Resumen

In [71]:

Copied!

sample_data.select(pl.col('estimateArrive')).describe()
sample_data.select(pl.col('estimateArrive')).describe()

Out[71]:

shape: (9, 2)

statistic	estimateArrive
str	f64
"count"	1.131139e6
"null_count"	0.0
"mean"	638.569465
"std"	455.633197
"min"	0.0
"25%"	274.0
"50%"	566.0
"75%"	922.0
"max"	5389.0

Las variables MinimunFrequency y MaximumFrequency están muy relacionadas entre ellas. Se podría dejar tan solo MinimumFrequency ya que aporta mas informacion al ETA que la otra

In [72]:

Copied!

sample_data = sample_data.drop('MaximumFrequency')
sample_data.head().collect()
sample_data = sample_data.drop('MaximumFrequency')
sample_data.head().collect()

Out[72]:

shape: (5, 15)

PK	date	datetime	bus	line	stop	positionBusLon	positionBusLat	positionTypeBus	DistanceBus	destination	MinimunFrequency	isHead	dayType	estimateArrive
str	date	datetime[μs]	i64	str	i64	f64	f64	i64	i64	str	i64	u8	str	i64
"2024-03-13 09:…	2024-03-13	2024-03-13 09:51:55.581672	4860	"49"	1550	-3.71357	40.483189	0	642	"PITIS"	4	0	"LA"	230
"2024-03-13 16:…	2024-03-13	2024-03-13 16:53:04.751456	2468	"134"	2864	-3.708434	40.498855	0	575	"MONTECARMELO"	9	0	"LA"	209
"2024-03-13 16:…	2024-03-13	2024-03-13 16:14:04.487383	2310	"178"	1760	-3.685786	40.47828	0	3282	"MONTECARMELO"	6	0	"LA"	415
"2024-03-13 16:…	2024-03-13	2024-03-13 16:56:06.220714	9120	"177"	5800	-3.688153	40.467592	0	910	"MARQUES DE VIA…	9	0	"LA"	607
"2024-03-13 08:…	2024-03-13	2024-03-13 08:29:03.023408	5637	"147"	29	-3.690032	40.455447	0	972	"BARRIO DEL PIL…	null	0	"LA"	170

Conclusión¶

Para cada .csv de un día concreto vamos a crear otro .csv que tan solo tenga las columnas:

PK
predict_arrival_date
reliable_arrival_date

Por otro lado vamos a crear otro .csv que tenga las columnas:

PK
date
datetime
bus
line
stop
positionBusLon
positionBusLat
DistanceBus
destination
MinimunFrequency
isHead
dayType
estimateArrive

El dataset de entrenamiento será la concatenación de todos los días y el join de ambos mediante la PK

EDA: EMT dataset 🚦

INESDATA-MOV

Análisis EDA¶

Dataset¶

Variables numéricas¶

Variables categóricas¶

Preprocesamiento previo¶

Creamos la PK¶

Eliminamos los valores incorrectos estimateArrive¶

Matriz de confusión¶

Estudio por variables¶

Conclusión¶

Eliminamos los valores incorrectos `estimateArrive`¶