A modest beginning
The data we collected was stored in MongoDB, where complex relations are stored as nested objects, which causes inconsistency. Therefore, to normalize data, we have a pipeline that transfers data into a relational database, particularly Snowflake.
Using Python and DVC, we processed this raw information and exported it as CSV files — several of them, each representing a slice of the larger puzzle.
For analysis, we export data into CSV and build a pipeline in Python which cleans and standardizes values.
At this point, our raw dataset suffered from duplicating columns, fragmentation, and human-made input. For instance, we collected the data over a long period, and country names were stored as official names and country codes. So our pipeline unified the approach over the whole dataset.
For the data version control and pipelining in Python we used DVC.
Our complex puzzle was becoming more and more clear, but still, in the initial export we encountered inconsistencies, missing values, and redundancies that threatened the reliability of our analysis.
The big cleanup
Cleaning a large dataset feels like untangling a string of Xmas lights.
But if you divide it into a few stages, it becomes easier.
Standardizing parameters
Imagine this: “New York City” appeared in at least five forms — “NYC,” “New York,” “N.Y.C.,” “New York, NY,” and “Big Apple.” There is not any validation on cities so misspelling, using different registers led to a situation when one city has dozens and even hundreds of options on how it is written.
Identification of cities is quite problematic in such a case. We processed it over a system that combined data cleaning with LLM and search algorithms.
We began with GeoNames — a database that became our secret weapon. Using the search with the Levenshtein distance algorithm, we automated name matching and it covered approximately 90% of the dataset.
For the remaining 10% we used an LLM-based solution which cleaned the naming.
Having the standardized inputs, we enriched our data with demographic and economic stats for cities and countries. We added GDP pre-capita and population data, which provided us with numeric parameters for further analysis.
Columns in crisis
Empty columns. Redundant columns. Columns that added nothing to the story.
We tackled them all:
- Blank fields were replaced with appropriate values or marked for exclusion.
- Redundant fields were consolidated, ensuring no data point was repeated unnecessarily.
- Complex lists of skills were reformatted into clean, searchable formats.
Step by step, the dataset shed its excess baggage, emerging lighter and more insightful.
The final test: Staying relevant
A clean dataset isn’t always a useful one. So, before declaring victory, we combed through each parameter, asking: Does this add value?
Some fields, while interesting, lacked utility and were removed. Others, previously overlooked, were given their rightful place in the spotlight.
This final review ensured every column served a purpose - some of them are to be discovered yet.
The foundation for what's next
This stage of the journey, though significant, is just the beginning. Ahead lies the exciting work of exploratory data analysis (EDA), uncovering patterns, and building predictive models.
But much like constructing a building, success depends on a strong foundation.
Preparing the dataset — cleaning, standardizing, and aligning it — wasn’t just a necessary step; it was the blueprint for everything that follows. By investing time here, we’ve ensured that what comes next will be rooted in reliability and precision.
As we dive deeper into this dataset, we’ll share the insights we uncover — stories hidden in the data, market trends that emerge, and lessons we learn along the way.
Stay tuned for our updates, and subscribe to our newsletter to follow this journey.
The big discoveries are still ahead!