The Battle of Neighborhoods: Coursera Capstone Project

Vishal Chauhan
7 min readOct 6, 2019

Opening a new authentic Indian restaurant in Queens, NY

1. Discussion and Background of the Business Problem:

Indian Restaurants

Introduction Section

This final project explores the best locations for Indian restaurants throughout the Queens of New York. New York is a major metropolitan area with more than 8.4 million (Quick Facts, 2018) people living within city limits. New York City is the largest city in the United States with a long history of international immigration. People came from many parts of the world. According to the 2007 American Community Survey estimates, New York City is home to approximately 315,000 people from the Indian subcontinent.
With its diverse culture, comes diverse food items. There are many restaurants in New York City, each belonging to different categories like Chinese, Indian, French, etc.

Target Audience

  • Business personnel who wants to invest or open a restaurant.
  • The freelancer who loves to have their own restaurant as a side business.
  • Finding the best location for opening a restaurant.
  • Budding Data Scientists, who want to implement some of the most used.
  • Exploratory Data Analysis techniques to obtain necessary data, analyze it and, finally be able to tell a story out of it.

Data Section

For this project we need the following data:
1. New York City data that contains Borough, Neighborhoods along with there latitudes and longitudes

  • Data Source: https://cocl.us/new_york_dataset
  • Description: This data set contains the required information. And we will use this data set to explore various neighborhoods of new york city.

2. Indian restaurants in Queens neighborhood of new york city.

  • Data Source: Foursquare API
  • Description: By using this API we will get all the venues in the Queens neighborhood. We can filter these venues to get only Indian restaurants.

Approach

  • Collect the new york city data from https://cocl.us/new_york_dataset.
  • Using Foursquare API we will get all venues for each neighborhood.
  • Filter out all venues which are Indian Restaurants.
  • Data Visualization and some statistical analysis.
  • Analyzing using Clustering (Specially K-Means):
    1. Find the best value of K
    2. Visualize the neighborhood with a number of Indian Restaurants.
  • Compare the Neighborhoods to Find the Best Place for Starting up a Restaurant.
  • Inference From these Results and related Conclusions

Problem Statement

  1. What is the best location for an Indian restaurant in Queens, New York City?
  2. In what Neighborhood should I open an Indian restaurant to have the best chance of being successful?

2. Data Preparation:

I will use New York City data for this project.

After further analysis we will get data with coordinates in a data frame:

Dataframe from New York City data

We will use geopy library for getting coordinates of Queens, NY for further use:

Coordinates of Queens

Using Foursquare Location Data:

Foursquare data is very comprehensive and it powers location data for Apple, Uber, etc. For this business problem I have used, as a part of the assignment, the Foursquare API to retrieve information about the Venue, Venue category with their longitudes and latitudes. The call returns a JSON file and we need to turn that into a data-frame. Here I’ve chosen 100 popular spots for each neighborhood with a radius of 500 meters. Below is the data-frame obtained from the JSON file that was returned by Foursquare —

3. Exploratory Data Analysis:

There are 271 unique categories in which Indian Restaurant is one of them. We will do one hot encoding for getting dummies of the venue category. So that we will calculate the mean of all venue groups by their neighborhoods.

After this we will extract only the Neighborhood and Indian Restaurant column for further analysis:

Mean of Indian restaurants group by neighborhoods

Clustering the Neighborhoods:

We will extract Indian restaurant data from the above table and fit this into the code for finding the best value of K.

From the above image, we see that the best value of K will be 3 according to the Elbow method.

We will merge the above table with our New York data frame so that we will get coordinates of all neighborhoods

We can see these 3 clusters in the Map using Folium Library.

Let’s Examine the Clusters:

Here, we have 3 clusters 0,1 and 2 respectively. In cluster 0 we have neighborhoods that have the least number of Indian Restaurants.

Cluster 0 has Red color on the map.

In cluster 1: We have all neighborhoods which have highly dense Indian Restaurants. In this dataset, we have only one neighborhood. Cluster 1 has a purple color on the map.

In cluster 2: We have all neighborhoods which have medium dense Indian Restaurants. Cluster 2 has a light green color on the map.

Visualization:

There are 5 boroughs in New York City in which Queens has the highest number of neighborhoods.

Here we see that Queens has the highest number of Neighborhood in Newyork city.

After that, we will see which neighborhood has the highest number of Indian restaurants.

Here we see that Bayside has the maximum number of Indian restaurants

In the above image, we see that Bayside has the highest number of Indian restaurants.

Result

The results of the exploratory data analysis and clustering is summarized below :

  1. Bayside neighborhood has the highest number of Indian restaurants.
  2. Jamaica Estates neighborhood has a high density of Indian restaurants.
  3. Cluster 0 neighborhoods have the least number of Indian restaurants.
  4. I will open my restaurant in the South Ozone Park neighborhood because it is near the International Airport. Because all immigrants will come to the nearest restaurant. So, the profit will be more.

Discussion

According to the analysis, South Ozone Park will provide the least competition for an upcoming Indian restaurant as the International Airport is close to this neighborhood. So, all this is the best place for Indian immigrants for having lunch/dinner and the frequency of Indian restaurants is very low compared to other neighborhoods.
Bayside has the highest number of Indian restaurants and Jamaica Estates is highly dense so, we will not open there.
Some drawbacks of analysis are: the clustering is completely based on the data provided by Foursquare API. Since land price, the distance of venues from the closest station, the number of potential customers, could all play a major role and thus, this analysis is definitely far from being conclusory. However, it definitely gives us some very important preliminary information on the possibilities of opening restaurants in the Queens borough of New York City.
Also, another pitfall of this analysis could be the consideration of only one major borough of New York City, taking into account all the areas under the 5 major boroughs that would give us an even more realistic picture. Furthermore, these results also could potentially vary if we use some other clustering techniques like DBSCAN.

Conclusion

Finally, to conclude this project, we have got a small glimpse of how a real-life Data science project looks like. I have used some frequently used python libraries to handle JSON file, plotting graphs, and other exploratory data analysis. Use Foursquare API to major boroughs of New York City and their neighborhoods. The potential for this kind of analysis in a real-life business problem is discussed in great detail. Also, some of the drawbacks and chances for improvements to represent even more realistic pictures are mentioned. As a final note, all of the above analyses is depended on the adequacy and accuracy of Four Square data. A more comprehensive analysis and future work would need to incorporate data from other external databases.

--

--