
So one week from Bootcamp is gone, so fast. Now is time for personal (You choose the topic) and invividual (You do it alone) projects. Project #2, is all about scraping and Lineal Regression.
For this project I must:
- Gather data using data scraping
- Analize Data using Linear Regression, this includes: feature selection, categorical variables, time series, Regularization
- Vizualise each step where possible to get a better view of the results and make a better decision.
As topic I decided to predict Coffee Production in Colombia for many reasons: I wanted a continuous variable to learn the best from Linear Regression, I wanted to analyze and scrape all sorts of data (weather, economical, etc.), and personally, I wanted a topic to tell a little bit about my country since many of my classmates had a not-very-positive impression.
Background:
Colombia is the second Coffee productor country in the world. Only 3% of the country produces 10% of the coffee consumed all over the globe. The purpose of this Project is to analize and determine what are those factors than influence high coffee production to ensure production for future Months. For that we will analyze several types of information:
- Weather Data: Weather conditions and particularly, month of harvest, can influence the growing of the crops
- Competitor Countries: Brazil is the biggest coffee productor, Vietnam has similar Production volume and other countries, might affect the production
- Economical Factors: Most of the Coffee Produced in the country is exported, can exporting volumes (Demand) influence? Also, since Colombia doesn’t use the traditional market currency (dollar), the change rate between dollar and Colombian Peso should be analyzed as well
Extract Data From Sources
Gathering all this information was actually the main challenge for this project:
-
Weather Data was scraped using Scrapy from Weather Underground for the three most important areas in Colombia where coffee is produced
-
Coffee Production Data fom Colombia, as well as exporting volumes and dollar change rate were downloaded in csv format from Colombian Coffee Growers Organization webpage
-
Economical and Coffee Production information from other countries were gathered from the International Coffee Organization webpage
I’ll write a future post on Scraping with Scrapy which I found very challenging, but very cool and useful in the end
General EDA analysis for all variables
First, a general EDA analysis was made to find possible relations between variables and have a clearer view on the features that were going to be used in the final model.
We can see that the highest relation between variables can be found in economical data (Exportation Volumes and dollar rate), which makes sense, and coffee price between different countries, where a further analysis will be made. Weather variables, however, seem to be mode independent between each other in general.
Creating the Model
First Model
First of all, I split the data between training set (70%) and test set (30%). I used eight years of data, one row per month = 96 rows. I didn’t use, in this case, a validation set, due to the low valume of rows I was working with.
The very first model I ran was an Ordinary Linear Regression (OLS) using only weather variables scraped from Weather Underground webpage: Temperature, wind speed, precipitations, humidity, dew point (All average per month). I also included an Intercept to this model to give it more flexibility.
Results: R2(test) = 0.14, Pvalues between [0.039 - 0.745].
This means that some variables are relevant, but there’s some information missing. There’s still work to do.
Categorical Variables
After weather, I decided to look at the hasvest month. It is said that cofee only grows between the months of May and October, so this might be a good variable to add to my model. Although havest month is a variable that ranges from 1 to 12, it’s not an ordinal or numerical variable: There is no reason to say that february (value=2) is the double as january (value=1) or that they are both less than september (value=9) or december (value=12). All months are equally important, month-number does not represent an amount and it doesn’t give any order (In this part of the analysis, we’ll get to Time Series later). Harverst month is what we call a categorical variable, and instead of being treated as one column with values (1-12), we will transform it into 11 columns with values (1-0). Each row will have 10 values in 0 and one value in 1, the month it belongs to. This process is called One Hot Encoding and he reason we generate 11 columns and not 12 is to reduce redundancy and correlation between this new variables: If all rows from 1st to 11th (january to november) have the value 0, this means that the row belongs to december. There is no need to add a new column because it will create correlation between them.
Results: After including the new 11 features: R2(test) = 0.64, Pvalues between [0.002 - 0.759].
The Model is improving!
Competing Country and Economical Analysis
As you can see in the picture above, there seems to be a big negative relation between price of Coffee, in Colombia, Brazil or other countries, and the Production volume. Does this majes sense? Do lower prices lead to higher production volume? Actually, is the other way around: Higher volumes will make prices go lower, and low production will make the coffee more expensive. If you think about it, this is simple offer and demand theory.
Results: Production volume doesn’t depend on Coffee prices. On the contrary, prices are dependant from the production. Price in Colombia or in other countries shouldn’t be included in the analysis.
Time Series Analysis
We’ve already seen that harvest month is an important feature to determine Production Volume, but what about other months? Assuming that the proces to export coffee after being harvested (Peeling, drying and in some cases grinding, all these processes in Colombia are done manually to get a better coffee!) takes around a month, lets look if the production is affected by the production from the month before. This is, is december production affected by november’s?, november’s by october’s? and so on? This is called time series, and in simple words is done by adding the dependant variable from one month as feature for predicting the next month.
Results: R2(test) improved to 0.67. Below you can see an image of the predictions using only previous-month-production as feature: You can see that last months predictions (Purple starts, labeled Jan through March) are very close from the real values (Green stars). This model predicts for next month, a coffee production of 1.176 thousand bags, well just have to wait and see the results. The model improved so I’ll include previous-month-production as a feature in the model
Update (May 2018): Real value for April 2018 was 1.037 thousands bags of coffee. Only about 130 thousand bags off from the models prediction!
Now, Hoy to find the best features? (Regularization)
Regularization is a process to find the most important features in our model. How can we do this? Mathematicaly this is done by adding a new term to our model’s equation representing the cost. There are multiple ways to do this, but all of them try to minimize the number or value of features included in the model. If we just added all features, the model will overfit, this is: it will accomodate very well to the training data but will do no good on test data. (The other way around would be underfit: The model will be too general and it won’t predict well in any of the cases). The methods I used for this models where:
- Lasso (L1): Takes the cost as the sum of the magnitude (absolute value) of all the coefficients. It will try to lower non-important feature to zero
- Ridge (L2): Takes the cost as the sum of all the square of all the coefficients. It will try to make non-important features very low
- Elastic-Net: Uses a mixture of both of this methods trying to determine, for each coefficient, the most optimal cost.
Results: The graph above (I really love this graph!) shows the absolute value of all features according to each of the regularization methods used (OLS is the original model without any regularization at all). The features marked in red were the ones I chose (My own, personal, subjective decision) for the final model. R2(test)=0.783!!!
Conclusions:
Coffee Production, instead of what was believed in the beginning of this analysis, doesn’t seem to be affected by weather conditions. Coffee will grow in any temperature and humidity % inside the area of observations. Although protection against rain MUST be preserved. Precipitations seem to be the only weather significant factor
Economical facts from international competidors such as Brazil, Vietnam, Peru and Mexico, don’t seem to affect either. The production of these countries, along with Colombia, seem to be influenced all in paralell by other external features in common not included in this analysis.
Month of Harvest seem to be the most important feature in Coffee Production. Periods of May until August as well as October until December seem to have the highest impact.
If you like to look at the full analysis, the data or the code, check out my Github