Data Science

Time Series and Non-time Series Data

Time series data is auto-correlated. In other words, the next value in time series data is affected by the previous values (history) in the series. Hence, such data is usually indexed in timely order, most frequently as equally spaced points in time, also called date time stamps. Most common examples are: Weather data from the past, data for stock prices, customer sales record for a company over time, financial data, daly or weekly or monthly volume demand from customers, wireless signals over time etc. If we receive the data with timestamps and the value which needs to be predicted is time dependent then we consider such data as time series data.

Let us take an example of a time series data: As a supplier of all different types of products, Amazon needs to know before two months, what will be the demand for their product (lets say product A)  in certain country of the world (lets say Germany) during certain period of time (lets say for the month of February). In such case, we need historical data and the data provided by Amazon might look as follows (this is just randomly created example ), where we have the date, day of week and demand as important features. Country, Region and product type are also important but we have considered them to be same in this example.


Country


Region


Date


Day of Week


Product type


Demand


DE


Europe


2018-12-03


Monday


Product A


200000

DE Europe 2018-12-04 Tuesday Product A 300000
DE Europe 2018-12-05 Wednesday Product A 350000
DE Europe 2018-12-06 Thursday Product A 180000
DE Europe 2018-12-07 Friday Product A 185000
DE Europe 2018-12-08 Saturday Product A 100000
DE


Europe


2018-12-09


Sunday


Product A


20000


When dealing with time series data, there are two terms one need to be very clear about – trend and seasonality. Trend refers to how the value is growing over time on average. Trend is linear in most of the cases, just to indicate if the growth is increasing (slope=+ve), decreasing (slope=-ve) or constant (slope = 0). In mathematical terms, trend can be obtained by determining the best fitting line to the given data points.

In real life, the data points  tend to show similar pattern repeated over time. Such pattern can repeat over hours, days, week, month, 3 months, 6 months, year etc. Such phenomenon is referred as ‘seasonality‘. Hence, depending on how the pattern is repeated there can be daily seasonality, weekly seasonality, monthly seasonality, three monthly seasonality, yearly seasonality etc.

The third behaviour which can be observed in the time series data is called Residual, which is the effect due to some random events happening over time, for example holidays, increase in fuel price, strike in certain region, government decisions, company policies etc. Among them, some effects such as  holidays repeat over year in exactly the same date but some occur in different dates in different years. If your business is highly impacted due to holidays, this effect needs to be taken care properly before doing any forecasting. The best way to do time series forecasting is to understand the business implications and know the external impacts that is happening in your business.

Other datasets which do not show time dependent behaviour or where auto-correlation are not considered are referred here as  ‘Non time series data‘. Such data sets have many independent features and a dependent variable, which can be regressed or classified based on the purpose. Many real life data are non-time series data. Some of common example data sets are: iris data set , titanic data set etc.

Some of the data points for the  titanic data set looks as follows:


PassengerID


Survived


Pclass


Name


Sex


Age


SibSp


Parch


Ticket


Fare


Cabin


Embarked


1


0


3


Braund, Mr. Owen Harris


male


22


1


0


A/5 21171


7.2500


NaN


S

2 1 1 Cumings, Mrs. John Bradley female 38 1 0 PC 17599 71.2835 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2.3101282 7.9250 NaN S
4 1 3 Futrelle, Mrs. Jacques Heath female 35 1 0 113803 53.100 C123 S
5


1


3


Allen, Mr. William Henry


male


35


0


0


373450


8.0500


NaN


S


    From the dataset, we can clearly see that the data in each row is independent of each other. The survival of certain passenger is not dependent on the other passengers but on each of the features of their own. It doesn’t matter if passenger 5 is survived or not for passenger 6 to survive. Some of the features such as age, sex, fare, embarkment etc. could be more relevant than the other features such as name, passengerId etc. The aim of any data science approach is to clearly identify all the important data features before applying any machine learning algorithms.

Real time  processing and Batch processing

Real time data processing, sometimes also called event processing, requires the continuous input data, immediate processing and the results. In other words, the response time for real time processing should be very quick or in real time (near real time) and the processing jobs can’t be postponed. In fact, only timely processed results are useful for real time processing data. For example, IOT data such as traffic sensors, temperature sensors, health sensors, logs of transactions, Fraud detection, network monitoring etc need to be processed as soon as the data stream is received. Most of the time, we receive continuous stream of data and this requires a lot of space and time to process if not processed instantly. Such data also uses real time processing methods.

Batch data processing, however, doesn’t need to be quick and near real time. The data can be stored and processed separately over time. For big data, Hadoop is used for batch processing. We can easily apply machine learning and deep learning algorithms in batch data processing.

Real time data processing requires different techniques and tools. Once we know the real time data sources such as sensor data and social media data, such data need to be ingested and transported. Some of the tools that are used for real time data (stream data) ingestion are: Apache Kafka , Apache Flume, amazon kinesis. Such data are then processed in real time using some common open source tool such as Apache Spark , Apache storm, Apache Flink etc. Apache Spark is widely used amongst data science community. The processed data are then stored using databases such as cassandra, mongoDB, Amazon athena, Amazon redshift, elasticsearch etc. Such data can be easily queried using streaming SQL.

Hybrid real time and batch processing architecture such as Lambda architecture and kappa architecture are quite common.