Knowing the floor number for all units in a building matters, which is why we have developed Machine Learning models to pick them up from apartment names, and show them on the HelloData platform. If you want to analyze prices and market trends on the floor-by-floor level alongside your data, it is also available through our API, on each unit's details.
Why Add Floor Detection As a Feature?
It started with an experiment. We have detected tens of thousands of floor numbers from apartment names in Chicago. We have also analyzed the price per floor, and the price per floor number. The results are quite interesting. An extra floor/story in Chicago is worth on average $0.0126/sqft, after controlling for other variables.
Consider two average apartment units. Same overall layout: 2 beds and 1 bath. One on the 5th floor, and the second on the 22nd floor. Both are 964 sqft ( the average for 2BR 1BA in our database). That leads to a difference of $213 on average between these two apartments. Given that the average effective rent for that layout is $1,831/mo, that represents more than 10% in price difference. We wanted our users to have access to that valuable information.
Our Approach
Let's use a 22 story building in New York as an example. The below unit names come from the subject property we will analyze.
01-1618
01-314
02-629
01-1704
01-1204
01-604
01-1615
01-1424
01-1825
02-2025
01-1506
Real estate professionals would likely recognize that the first number (01- & 02-) indicates a building or internal reference. The last digits seem like potential floor numbers, but some, like in '01-1825', exceed the building's 22 stories. This leads us to intuitively suspect that the second number (one or two digits) is the actual floor. We wanted our solution to be able to do exactly that.
Training our Machine Learning Models
Looking at a small enough subset of the unit names above, we could easily be misled and detect the wrong pattern. We needed to design a model that was able to look at the full context, such as number of stories and number of units, to understand the type of asset we are looking at. Garden style buildings, high rises, and mid-rises all have different patterns in their unit names. So the model needed to be able to understand that.
Statistical Approach: Our very first approach was actually not using machine learning. We hand-coded an algorithm that estimated the number of units per floor, then looked at blocks of characters in unit names to see how much they vary across all units. We then picked the most statistically likely part of the name that was compatible with the property's characteristics. That approach worked well on uniform unit patterns. But when a property was mixing different patterns, it became complex to extract statistical signals without even knowing the number of different patterns used.
First Machine Learning Attempt: Large Language Models are great at understanding context! So that was our first attempt. We fine-tuned a task-specific model on a dataset we built of 21k buildings and their unit names, and it was able to detect the floor number with a 94.8% accuracy. That's good, but it was too slow to be used in production. After running the numbers, we realized that it'd cost us a fortune to run it on all the unit names we have.. As of today, HelloData’s database has 2.6M properties throughout the states. We get new data daily so we would need to recompute tens of millions of units on a daily basis. It would have cost us about $10k/day to run this model. We are all about being cost-effective, and also simply not willing to throw $3.6M/year at that, so we kept looking for a better solution.
Final Model: We decided to design a smaller, proprietary model, a similar architecture to what Google uses for their translation, and trained it on a dataset of 200,000 buildings with between 1 and 50 different unit names. We threw a lot of complicated unit names at it, added some noise, random characters, inverted some parts of the names to generate a lot of data, and trained it on that. It was able to detect the floor number with a 98.2% accuracy, and it was 1000x faster than the LLM.
Most “AI Companies” these days use OpenAI or Mistral AI behind the scenes. By being focused on speed and saying no to adding a new big cost, we developed a super-efficient and more accurate model that could easily be used into our existing data pipeline.
Difficulties
On that path, we had to deal with several difficulties:
Inconsistent Patterns: From what we see on property websites, people are not always consistent in the way they name their units. We had to handle name inconsistencies. You know that unit that's called “Apt 1005” when others follow a different pattern like “10-5”? Real estate data is anything but uniform, making it very challenging to solve for seemingly easy prompts... Developing an AI instead of a more rigid algorithm solved most of that problem. We made sure our training dataset was diverse enough to cover all these cases.
Incorrect Floor Number: Whether it's a typo, or a person that changed the floor information to guide the revenue management system into a different price, floors can easily be misleading. We did a lot of data cleaning and manual checks to make sure we were not training our model on incorrect data.
Need for Number of Stories: We wanted to use the number of stories in the building as a feature for our model. To make sure that we wouldn't detect a unit on the 40th floor in a garden style community. But that piece of information is not available for all buildings. So we built a computer vision model to detect the number of stories from the building's pictures. Given a facade, we can estimate the bucket of stories the building is in. So we could substitute that feature in our model.
What is the value of an extra floor in X?
All markets can't support the same value for extra floors. Using all our detected floors, we have computed the value for some of them:
Average value of an extra floor in Chicago, IL: $0.0126/sqft
Average value of an extra floor in Dallas, TX: $0.0118/sqft
Average value of an extra floor in Atlanta, GA: $0.0106/sqft
Average value of an extra floor in New York, NY: $0.0101/sqft
Average value of an extra floor in Miami, FL: $0.0072/sqft
What's next?
So much! We are determined to know and show you all the important variables that you would use when pricing units and understanding your markets, whether it's on our platform or through our API.
Corner Unit Detection: we actually started HelloData with a product called Floorplan.ai. We are now revamping it to detect corner units from floorplans.
Renovated Unit Detection: renovations drive a lot of value in real estate. We are working on detecting that from unit names and pricing patterns.
Floor-aware Price Recommendation: our platform (and API) offer price recommendations. We are working on incorporating the floor number into that recommendation.
Data Scientist Nicolas Lassaux, with expertise in real estate analytics, was pivotal at Enodo and Walker & Dunlop. Co-founder of Hello Data, he's elevating real estate decisions through innovative data use. Passionate about running, cycling, and music.