As the international tech giant moves toward Russian ownership, the leak raises concerns about the volume of data it has on its users.
If you live in Russia, there’s no avoiding Yandex. The tech giant—often referred to as “Russia’s Google”—is part of daily life for millions of people. It dominates online search, ride-hailing, and music streaming, while its maps, payment, email, and scores of other services are popular. But as with all tech giants, there’s a downside of Yandex being everywhere: It can gobble up huge amounts of data.
In January, Yandex suffered the unthinkable. It became the latest in a short list of high-profile firms to have its source code leaked. An anonymous user of the hacking site BreachForums publicly shared a downloadable 45-gigabyte cache of Yandex’s code. The trove, which is said to have come from a disgruntled employee, doesn’t include any user data but provides an unparalleled view into the operation of its apps and services. Yandex’s search engine, maps, AI voice assistant, taxi service, email app, and cloud services were all laid bare.
The leak also included code from two of Yandex’s key systems: its web analytics service, which captures details about how people browse, and its powerful behavioral analytics tool, which helps run its ad service that makes millions of dollars. This kind of advertising system underpins much of the modern web’s economy, with Google, Facebook, and thousands of advertisers relying on similar technologies. But the systems are largely black holes.
Now, an in-depth analysis of the source code belonging to these two services, by Kaileigh McCrea, a privacy engineer at cybersecurity firm Confiant, is shedding light on how the systems work. Yandex’s technologies collect huge volumes of data about people, and this can be used to reveal their interests when it is “matched and analyzed” with all of the information the company holds, Confiant’s findings say.
McCrea says the Yandex code shows how the company creates household profiles for people who live together and predicts people’s specific interests. From a privacy perspective, she says, what she found is “deeply unsettling.” “There are a lot of creepy layers to this onion,” she says. The findings also reveal that Yandex has one technology in place to share some limited information with Rostelecom, the Russian-government-backed telecoms company.
Yandex’s chief privacy officer, Ivan Cherevko, in detailed written answers to WIRED’s questions, says the “fragments of code” are outdated, are different from the versions currently used, and that some of the source code was “never actually used” in its operations. “Yandex uses user data only to create new services and improve existing ones,” and it “never sells user data or discloses data to third parties without user consent,” he says.
However, the analysis comes as Russia’s tech giant is going through significant changes. Following Russia’s full-scale invasion of Ukraine in February 2022, Yandex is splitting its parent company, based in the Netherlands, from its Russian operations. Analysts believe the move could see Yandex in Russia become more closely connected to the Kremlin, with data being put at risk.
“They have been trying to maintain this image of a more independent and Western-oriented company that from time to time protested some repressive laws and orders, helping attract foreign investments and business deals,” says Natalia Krapiva, tech-legal counsel at digital rights nonprofit Access Now. “But in practice, Yandex has been losing its independence and caving in to the Russian government demands. The future of the company is uncertain, but it’s likely that the Russia-based part of the company will lose the remaining shreds of independence.”
The Yandex leak is huge. The 45 GB of source code covers almost all of Yandex’s major services, offering a glimpse into the work of its thousands of software engineers. The code appears to date from around July 2022, according to timestamps included within the data, and it mostly uses popular programming languages. It is written in English and Russian, but also includes racist slurs. (When it was leaked in January, Yandex said this was “deeply offensive and completely unacceptable,” and it detailed some ways that parts of the code broke its own company policies.)
McCrea manually inspected two parts of the code: Yandex Metrica and Crypta. Metrica is the firm’s equivalent of Google Analytics, software that places code on participating websites and in apps, through AppMetrica, that can track visitors, including down to every mouse movement. Last year, AppMetrica, which is embedded in more than 40,000 apps in 50 countries, caused national security concerns with US lawmakers after the Financial Times reported the scale of data it was sending back to Russia.
This data, McCrea says, is pulled into Crypta. The tool analyzes people’s online behavior to ultimately show them ads for things they’re interested in. More than 300 “factors” are analyzed, according to the company’s website, and machine learning algorithms group people based on their interests. “Every app or service that Yandex has, which is supposed to be over 90, is funneling data into Crypta for these advertising segments in one form or another,” McCrea says.
Some data collected by Yandex is handed over when people use its services, such as sharing their location to show where they are on a map. Other information is gathered automatically. Broadly, the company can gather information about someone’s device, location, search history, home location, work location, music listening and movie viewing history, email data, and more.
The source code shows AppMetrica collecting data on people’s precise location, including their altitude, direction, and the speed they may be traveling. McCrea questions how useful this is for advertising. It also grabs the names of the Wi-Fi networks people are connecting to. This is fed into Crypta, with the Wi-Fi network name being linked to a person’s overall Yandex ID, the researcher says. At times, its systems attempt to link multiple different IDs together.
“The amount of data that Yandex has through the Metrica is so huge, it’s just impossible to even imagine it,” says Grigory Bakunov, a former Yandex engineer and deputy CTO who left the company in 2019. “It’s enough to build any grouping, or segmentation of the audience.” The segments created by Crypta appear to be highly specific and show how powerful data about our online lives is when it is aggregated. There are advertising segments for people who use Yandex’s Alice smart speaker, “film lovers” can be grouped by their favorite genre, there are laptop users, people who “searched Radisson on maps,” and mobile gamers who show a long-term interest.
McCrea says some categories stand out more than others. She says a “smokers” segment appears to track people who purchase smoking-related items, like e-cigarettes. While “summer residents” may indicate people who have holiday homes and uses location data to determine this. There is also a “travelers” section that can use location data to track whether they have traveled from their normal location to another—it includes international and domestic fields. One part of the code looked to pull data from the Mail app and included fields about “boarding passes” and “hotels.”
Some of this information “doesn’t sound that unusual” for online advertising, McCrea says. But the big question for her is whether creating personalized adverting is a good enough reason to collect “this invasive level of information.” Behavioral advertising has long followed people around the web, with companies hoovering up people’s data in creepy ways. Regulators have failed to get a grip on the issue, while others have suggested it should be banned. “When you think about what else you could do, if you can make that kind of calculation, it’s kind of creepy, especially in Russia,” McCrea says. She suggests it is not implausible to create segments for military-aged men who are looking to leave Russia.
Yandex’s Cherevko says that grouping users by interests is an “industry standard practice” and that it isn’t possible for advertisers to identify specific people. Cherevko says the collection of information allows people to be shown specific ads: “gardening products to a segment of users who are interested in summer houses and car equipment to those who visit gas stations.” Crypta analyzes a person’s online behavior, Cherevko says, and “calculates the probability” they belong to a specific group.
“For Crypta, each user is represented as a set of identifiers, and the system cannot associate them with a natural person in the real world,” Cherevko claims. “This kind of set is probabilistic only.” He adds that Crypta doesn’t have access to people’s emails and says the Mail data in the code about boarding passes and hotels was an “experiment.” Crypta “received only de-identified information about the category from Mail,” and the method has not been used since 2019, Cherevko says. He adds that Yandex deletes “user geolocation” collected by AppMetrica after 14 days.
While the leaked source code offers a detailed view of how Yandex’s systems may operate, it is not the full picture. Artur Hachuyan, a data scientist and AI researcher in Russia who started his own firm doing analytics similar to Crypta, says he did not find any pretrained machine learning models when he inspected the code or references to data sources or external databases of Yandex’s partners. It’s also not clear, for instance, which parts of the code were not used.
McCrea’s analysis says Yandex assigns people household IDs. Details in the code, the researcher says, include the number of people in a household, the gender of people, and if they are any elderly people or children. People’s location data is used to group them into households, and they can be included if their IP addresses have “intersected,” Cherevko says. The groupings are used for advertising, he says. “If we assume that there are elderly people in the household, then we can invite advertisers to show them residential complexes with an accessible environment.”
The code also shows how Yandex can combine data from multiple services. McCrea says in one complex process, an adult’s search data may be pulled from the Yandex search tool, AppMetrica, and the company’s taxi app to predict whether they have children in their household. Some of the code categorizes whether children may be over or under 13. (Yandex’s Cherevko says people can order taxis with children’s seats, which is a sign they may be “interested in specific content that might be interesting for someone with a child.”)
One element within the Crypta code indicates just how all of this data can be pulled together. A user interface exists that acts as a profile about someone: It shows marital status, their predicted income, whether they have children, and three interests—which include broad topics such as appliances, food, clothes, and rest. Cherevko says this is an “internal Yandex tool” where employees can see how Crypta’s algorithms classify them, and they can only access their own information. “We have not encountered any incidents related to access abuse,” he says.
Yandex is going through a breakup. In November 2022, the company’s Netherlands-based parent organization, Yandex NV, announced it will separate itself from the Russian business, following Russia’s invasion of Ukraine. Internationally, the company, which will change its name, is planning to develop self-driving technologies and cloud computing, while divesting itself from search, advertising, and other services in Russia. Various Russian businessmen have been linked to the potential sale. (At the end of July, Yandex NV said it plans to propose its restructuring to shareholders later this year.)
While the uncoupling is being worked out, Russia has been trying to consolidate its control of the internet and increasing censorship. A slew of new laws requires more companies and government services in the country to use home-grown tech. For instance, this week, Finland and Norway’s data regulators blocked Yandex’s international taxi app from sending data back to Russia due to a new law, which comes into force in September, that will allow the Federal Security Service (FSB) access to taxi data.
These nationalization efforts coupled with the planned ownership change at Yandex are creating concerns that the Kremlin may soon be able to use data gathered by the company. Stanislav Shakirov, the CTO of Russian digital rights group Roskomsvoboda and founder of tech development organization Privacy Accelerator, says historically Yandex has tried to resist government demands for data and has proved better than other firms. (In June, it was fined 2 million rubles ($24,000) for not handing data to Russian security services.) However, Shakirov says he thinks things are changing. “I am inclined to believe that Yandex will be attempted to be nationalized and, as a consequence, management and policy will change,” Shakirov says. “And as a consequence, user data will be under much greater threat than it is now.”
Bakunov, the former Yandex engineer, who reviewed some of McCrea’s findings at WIRED’s request, says he is scared by the potential for the misuse of data going forward. He says it looks like Russia is a “new generation” of a “failed state,” highlighting how it may use technology. “Yandex here is the big part of these technologies,” he says. “When we built this company, many years ago, nobody thought that.” The company’s head of privacy, Cherevko, says that within the restructuring process, “control of the company will remain in the hands of management.” And its management makes decisions based on its “core principles.”
But the leaked code shows, in one small instance, that Yandex may already share limited information with one Russian government-linked company. Within Crypta are five “matchers” that sync fingerprinting events with telecoms firms—including the state-backed Rostelecom. McCrea says this indicates that the fingerprinting events could be accessible to parts of the Russian state. “The shocking thing is that it exists,” McCrea says. “There’s nothing terribly shocking within it.” (Cherevko says the tool is used for improving the quality of advertising, helping it to improve its accuracy, and also identifying scammers attempting to conduct fraud.)
Overall, McCrea says that whatever happens with the company, there are lessons about collecting too much data and what can happen to it over time when circumstances change. “Nothing stays harmless forever,” she says.