The Battle of Neighborhoods

Exploring ‘The Big Durian’ for a Coffee Shop Business Opportunity

Ginanjar Saputra
7 min readFeb 15, 2021

This article is written as a part of the final project submission for the Applied Data Science Capstone course on Coursera’s IBM Data Science Professional Certificate.

The project can be accessed on this GitHub repository.

Monumen Nasional and the Jakarta skyline, by Gunawan Kartapranata

Introduction

Jakarta is the special capital region of Indonesia, an archipelago in Southeast Asia. It is located on the northwest coast of Java, a home for a population of 10.5 million, and is the second-largest urban agglomeration in the world . A melting pot of many cultures, Jakarta is the center of Indonesia’s economic activities which has attracted people from across the archipelago in search of opportunities and a potentially better standard of living.

Business opportunities abound in Jakarta, but the food-and-beverage (F&B) sector has long been an attractive target for investors. It has recorded the largest investment realization among secondary sectors in Indonesia over the last five years, totaling IDR 293 trillion. Coffee shop, in particular, has been a booming F&B business in Indonesia, reflected on the significant rise in number of outlets and domestic coffee consumptions in the recent years. The market value of coffee shops is also estimated to reach over IDR 4 trillion per year. With such a promising prospect, various stakeholders (entrepreneurs, investors) may be interested to explore coffee shop business opportunities in Jakarta.

Problem Statement

This data science project is thus carried out to help stakeholders answer the following question:

“Which of the Jakarta regions are strategic for opening a coffee shop business?”

Apart frombusiness stakeholders, the project may also be of interest to fellow coffee enthusiasts.

Data

  1. The names of administrative regions (city, district, subdistrict) in Jakarta and corresponding postal codes. The data was scraped from a directory on this website.
  2. Geographical coordinates of Jakarta and its subdistricts, obtained using Nominatim geocoder from the GeoPy library.
  3. Information about venues in Jakarta regions: the names, category, venue latitudes, venue longitudes. These are obtained using Foursquare API.

Methodology

The following Python libraries and dependencies were used: Pandas, NumPy, Requests, BeautifulSoup, time, string, GeoPy (Nominatim geocoder), JSON, Matplotlib, Folium, and scikit-learn.

After sending a get request to the website of interest, the response (a HTML of the webpage) was parsed using BeautifulSoup and the relevant data (names of cities, districts, subdistricts, postal codes) were scraped. We obtained 267 entries as a result. Geographical coordinates were then retrieved via Nominatim, using postal codes as input. The resulting dataframe (tabular data structure) is as follows:

Regions in Jakarta, corresponding postal codes, and geographical coordinates (first 10 rows displayed).

Jakarta consists of 5 cities in mainland Jakarta and 1 regency off the coast of Jakarta. Each of these cities/regency is further subdivided into districts (kecamatan) and then subdistricts (kelurahan). In total, there are 44 districts and 267 subdistricts.

Subdivisions of regions in Jakarta.

The next step is to make API calls to Foursquare to get a list of venues within a defined radius (in this case, 1 km) of a particular coordinate. The result was limited to a maximum of 100 venues per subdistrict. Besides the name of the venue, details such as venue category as well as the latitude and longitude were also obtained. A total of 14739 entries were returned.

Venues surrounding Jakarta subdistricts, collected from Foursquare (first 10 rows displayed).

Grouping the venues by category, the top 10 are dominated by venues of the Food & Beverage business. Indonesian Restaurant and Coffee Shop are basically tied as the most common venues in Jakarta.

Top 10 most common venue categories in Jakarta.

The subdistricts were to be clustered based on the similarity of their surrounding venues. That way, insights can be derived as to which region in which cluster are coffee shops highly-concentrated.

Before clustering, categorical variables were translated in numerical variables through one-hot encoding. The data were grouped by subdistrict, and the means of the frequency of venue occurence in a subdistrict were calculated.

One-hot encoding, showing mean frequency of venue occurence in each subdistrict (first 10 rows displayed).

Sorting the frequency of venue occurrence, we were able to get the most common venues of every subdistrict.

Top 5 common venues of every subdistrict (first 10 rows displayed).

K-Means Clustering

K-Means clustering is a machine learning algorithm that creates homogeneous subgroups/clusters from unlabeled data such that data points in each cluster are as similar as possible to each other according to a similarity measure (e.g., Euclidian distance).

A value for k (number of clusters) needs to be defined before proceeding with the clustering. The “Elbow Method” was used, which calculates the sum of squared distances of data points to their closest centroid (cluster center) for different values of k. The optimal value of k is the one after which there is a plateau (no significant decrease in sum of squared distances).

However, because there is no discernible “elbow”, another measure was used: “Silhouette Score”. Silhouette score varies from -1 to 1. A score value of 1 means the cluster is dense and well-separated from other clusters. A value nearing 0 represents overlapping clusters, data points are close to the decision boundary of neighboring clusters. A negative score indicates that the samples might have been assigned into the wrong clusters. Given that there is a peak at k = 6, the K-Means clustering was proceeded with that value.

Each subdistrict was assigned a cluster label (0–5). These clusters were color-coded and visualized on a map of Jakarta to examine how they are distributed across the regions. The Folium library was used to generate the map.

Each subdistrict was assigned a cluster label (first 10 rows displayed).
Clusters of Jakarta subdistricts based on similarity of venues.

The clusters were separately analyzed in order to gain an understanding of a discriminating venue that characterize each of them. The number one most common venue categories from each cluster, as well as the regions (cities) in which a particular cluster is highly concentrated were singled out.

Concentration of cluster members across the cities of Jakarta.
Most common venue, Cluster 0 (Red): Fast Food Restaurant.
Most common venue, Cluster 1 (Purple): Convenience Store.
Most common venue, Cluster 2 (Blue): Chinese Restaurant.
Most common venue, Cluster 3 (Cyan): Noodle House.
Most common venue, Cluster 4 (Light Green): Paper / Office Supplies Store.
Most common venue, Cluster 5 (Orange): Indonesian Restaurant.

The total number of coffee shops within each of the Jakarta cities and districts were calculated to examine the distribution of coffee shop businesses and to help figure out strategic locations. This distribution was visualized using a choropleth map. GeoJSON file containing city boundaries was obtained from this repository on GitHub.

Total number of coffee shops in Jakarta cities.
Ten districts having the highest number of coffee shops.
Ten districts having the lowest number of coffee shops.

Results and Discussion

Exploratory data analysis as well as machine learning and visualization techniques have provided us with some insights into the problem at hand.

A total of 14739 venues from all Jakarta regions (267 subdistricts) were returned at the time the API call was made. There are on average 55 venues within a kilometer of a subdistrict center, where two of the most common categories overall are Indonesian Restaurants and Coffee Shops.

After deciding on an optimal k value of 6, K-Means algorithm was run to cluster the subdistricts based on their most common surrounding venues. Each of the six clusters, labeled 0–5, is characterized by a dominant venue as follows:

Clustering results summary.

A considerable number of coffee shops can be found within Cluster 5 (41 shops out of 151 venues). In fact, it is the second-most common venues in that cluster. Choropleth map of coffee shop locations across mainland Jakarta shows that Jakarta Selatan has a very high concentration of the business, i.e., 426 shops while the rest are below 200. The districts in Jakarta Selatan, therefore, are not viable options for opening up a coffee shop business because they are already way too saturated.

It is therefore recommended that stakeholders look into opportunities in Jakarta Timur (e.g., Cakung, Kramat Jati) and Jakarta Utara (e.g., Kelapa Gading), as these two cities have the least concentration of coffee shops and would significantly minimize competition. If, however, a moderate competition is not a concern then districts in Jakarta Pusat (e.g., Cempaka Putih, Johar Baru) and Jakarta Barat (e.g., Kalideres, Cengkareng) are also recommended.

Conclusion

Stakeholders searching for opportunities to open a coffee shop in Jakarta may want to consider setting up their business someplace where competitions are not severe. All of Jakarta sub-regions were explored and then clustered based on the similarity of their surrounding venues using K-Means clustering algorithm. Analysis results show that districts in Jakarta Utara and Jakarta Timur are among the best candidates for a new coffee shop location.

--

--

Ginanjar Saputra

Metallurgical Engineer • Aspiring Data Professional • Emperor Penguin Fan