“Is this the real life? Is this just fantasy?” raps Kanye West, in what sounds like an a capella cover of Freddie Mercury’s timeless opening lines. The performance is so convincing, you might be surprised to learn that it never really happened. Thanks to new AI, users can create vocal “deepfakes” of their favorite celebrities, the most popular example being a viral performance of Queen’s “Bohemian Rhapsody” in the disturbingly realistic “voice” of Kanye West.
It is incredibly easy to create a vocal performance using a famous rapper’s voice with the help of Uberduck, a popular AI-driven text-to-speech (TTS) synthesis engine. After logging in with a Gmail or Discord account, users can select from a drop-down menu of different categories (such as “rappers”) and then specify the individual voice within that category (such as the artist “Kanye West”). The user is then directed to either enter the text they wish to hear or select prewritten versions of sung and spoken snippets. After they instruct the system to “synthesize” all the information, the text is rendered audible, and the user has the chance to engage in further vocal processing, including changing the speed, pitch, and word length. Within minutes the performance is ready to be downloaded, overlaid on a TikTok video, and shared.
It’s also incredibly easy to view performances like “Kanye’s” as a novel, if trivial tech flex. But I argue that the ease with which creators and developers have co-opted and presented the voices of notable Black hip-hop artists like Kanye West and JAY-Z actually represents the latest development in a broader history of what musicologist Matthew D. Morrison calls “Blacksound”: the “sonic and embodied legacy of blackface performance as the origin of all popular music, entertainment, and culture in the United States.” Long before the advent of AI, the racist theatrical and musical form known as blackface minstrelsy emerged in the 1820s and became America’s first national form of entertainment. Minstrel shows, which evolved and continued well into the 20th century through radio, film, and television, consisted primarily of traveling groups of white performers donning black face paint, then acting out caricatures of slaves and free Northern Black folks through song and dance. Minstrelsy has been widely understood as a reflection of white revulsion at and fears about Black people, but it is importantly also a demonstration of white desires to transgress perceived race, class, and gender norms through perverse racial performances: a dialectic that has been framed as “love and theft” by Eric Lott and more recently as “terror and enjoyment” by Saidiya Hartman.
In recent years, terms like “high-tech blackface” and “digital blackface” have become popularized, as scholars on race and media have begun to theorize how this dialectic shows up in unique ways in the technologies of the digital age, enabling non-Black people to adopt Black personhood through their avatars and across networked platforms like Facebook and Twitter. Much has been said by scholars, cultural critics, and everyday observers about the use of African American Vernacular English (AAVE) and the “blaccent” by non-Black people and companies seeking to harness the selling power of Black culture through tweets, memes, and other forms of quick content, with no investment in actual Black communities or people. Tools like Uberduck might therefore meaningfully be understood as extending these kinds of appropriative digital practices into the realm of sonic performance.
In many ways, my specific concerns about Uberduck are connected to broader developments that I have observed in regard to rap music, AI, and the veneer of techno-optimism that increasingly brings these worlds together. I am a Black feminist rapper with a PhD in science and technology studies (STS), a field that examines the social relations that coproduce scientific and technological knowledge and practices. As such, I have long been interested in exploring our dominant narratives about the technologies we make and use. So I couldn’t help but raise an eyebrow when a succession of stories at the intersection of rap performance and AI flitted across my radar last spring: first I was introduced to FN Meka, an “AI robot rapper” who, perhaps unsurprisingly, also sells NFTs. Around the same time, Google Arts & Culture announced the Hip Hop Poetry project, led by creative technologist Alex Fefegha, to answer the question of whether AI can rap. A few weeks later, I learned about the success of Uberduck imitating Kanye West. I listened only once before putting down my phone in discomfort.
I’ve since begun to think more deeply about the messaging around AI that emanates from stories like these—about whether, the creepiness and potential legal thorniness aside, we should uncritically accept the use of AI as a mode for crafting rap lyrics and performances. I worry that in our excitement to explore these new creative potentials we risk reproducing the same exploitative dynamics that currently separate Black and brown artists from the fruits of their labor, across music and countless other forms of entertainment.
For those of us who are invested in the future of hip-hop music and culture, it is important to start asking questions about what, if any role we want AI to play in the rap space. We should be drawing on emergent discussions about digital blackface and Blacksound as a springboard for thinking more critically about who benefits from performances like deepfake Kanye’s “Bohemian Rhapsody.” As a practical matter, we should also be strategizing around the reality that future rappers may have to regularly contend with the use of AI to invoke Black voices and forms of expression in the absence of actual Black performers. And finally, we should be actively trying to develop rap synthesis tools that prioritize the pleasure of those most often exploited within the music industry. That is the version of hip-hop’s future that I want to experience.
When it comes to AI, like many working musicians, I’ve most immediately been concerned with the growing power of algorithmic curation. Such algorithms control the circulation of my music across apps like Spotify, and in the case of platforms like YouTube and TikTok, visual content as well.
I am intimately aware of just how competitive the rental market is for space on official Spotify playlists. For example, one of the primary roles that my record label played during my last release cycle was to try to get my singles placed on different Spotify playlists. I also know how much impact that placement can have: as I’m writing this, my song “Crown” sits at 287,000 listens on Spotify—after having landed on a Spotify workout playlist a number of years ago—while most of my other songs from that same project hover around 40,000 or fewer listens. Thus, the difference between being on a playlist and not is the difference between earning around $1,090 and earning $152 (according to the $.0038-per-stream model used by the Union of Musicians and Allied Workers in their ongoing campaign to force Spotify to change its payment structure).
The bottom line? Ask any working musician, and AI has probably been on their mind over the past few years, far more than they might even recognize. But what I had not considered until recently—and is now my primary focus—is the way AI has largely been positioned as an unproblematic tool for rap composition.
As with my understanding of AI in the context of music distribution, I first arrived at these conversations around AI and rap composition as a matter of necessity: in April 2019 I joined the team at Glow Up Games to help design RhymeStep for our first project, Insecure: The Come Up Game. We conceived of RhymeStep as similar to Uberduck, a responsive rap mechanic that combines information drawn from a large pool of rap lyrics with certain user-generated inputs in order to produce a series of synthesized rhymes (although in our case, the rhymes are displayed as written text rather than sounded performances).
Never having worked on any kind of procedural design project, I immediately began googling and testing as many simple text-based rap generators as I could find (like RapPad and Song Lyrics Generator). Most seem to follow the same basic conceit: the user is asked to enter a “topic” of discussion or answer a series of prompts (like naming “a place” or “something somebody might complain about”), the responses to which are used to generate a series of bars (individual lines in a rap verse), crowdsourced from fragments of bars written by an in-house community of writers or lyrics from rap luminaries like Kanye West, Nicki Minaj, and Eminem.
As an exercise in “writing” rhymes, the overall experience of using a rap generator felt fairly abstract. The rhymes churned out by each system weren’t designed to artfully engage figurative language and wordplay beyond what was originally authored in the source material. Moreover, the content of the verses always ranged from loosely coherent to completely bizarre. Consider the “verse” I generated on the topic of “Anything”:
Still, as a research project about procedural rap synthesis, I found these initial explorations to be incredibly instructive, particularly as examples of the ways that different values and norms are encoded as deeply as the backend sorting logic itself. Within Song Lyrics Generator, for example, I soon noticed that no matter what topic I selected as a prompt, at least one or two lines featured phrases typically used to demean queer folks or women (like “chase these skirts”) or referenced sex acts in terms of control and domination (like “to take advantage with sex”). I began to wonder whether the frequent inclusion of these kinds of lyrics reflected an overrepresentation of the ideas in the source material or a decision on the part of the programmers to actively pull such words and phrases, no matter the songwriting prompt. Whatever the case, the fact that these “randomly” generated verses so frequently contained the same kinds of misogynist and queer-antagonistic fantasies found in many mainstream rap songs provided immediate cues about who the imagined audience for this tool is and is not.
And in a blog post reflecting on the process of developing generative-DOOM—a rap generator based on the oeuvre of rapper MF DOOM—technologist and activist Nabil Hassein makes explicit the imperialist, anti-Black elements embedded in the Pronouncing rhyme library that he used to program the system for detecting a rhyme:
As is documented on Pronouncing’s own website, it is based on a pronouncing dictionary which (like many other projects in the history of computing) was funded by the US military through DARPA. Only General American English is represented in the list of pronunciations for a given word. … Like many MCs, [MF DOOM] typically employs African American Vernacular English (AAVE), although again like many MCs, he will sometimes pronounce the same words differently for reasons of character and narrative, or to obtain more rhymes. Therefore, since my program depends on Pronouncing to generate pairs of rhyming words, many rhymes that DOOM might make, or has in fact made, will not be captured by generative-DOOM as it exists now. … It is remarkable that computing generally and natural language processing (NLP) specifically have been shaped so pervasively by imperialism that even a hip-hop rhyme generator art project is indelibly marked by it. This is without even touching on how much NLP research is English-only.
Similarly, as I tested various rap generators, I thought about the way that hip-hop’s flexibility plays such a critical role in its cultural legibility, especially as it relates to the application of literary techniques like alliteration or slant rhymes: rhymes formed from words with similar but not identical rhyming sounds. In rap verse, slant rhymes can be particularly engaging because they invite a reimagination of how heard language operates on multiple levels while calling attention to the ingenuity of the MC.
To what extent, I wonder, do these “slant” kinds of creative possibilities fundamentally resist the logics of AI scripting, given rap’s cultural function as an art form that trades in what I have called “playful and militant forms of resistance and transgression”? And if rap is such an art form, then how might we create systems that are at least more flexible, and therefore reflective of the brilliance many rappers display through their transformative use of rhyme and rhythm?
Such considerations are even more complicated to sort out in the neighboring world of AI-powered vocal synthesis, exemplified by the deepfake Kanye performance of “Bohemian Rhapsody.” Naturally the increased circulation of audio deepfakes has produced all kinds of debates about authorship and ownership in regard to the timbre, cadence, and overall “style” of a particular voice. Content creators and their fans who circulate these kinds of songs on social media regard AI vocal imitations as a harmless tool of musical “fan fiction,” while more established artists like JAY-Z have (unsuccessfully) tried to make the case that circulation of these performances constitutes unlawful use of an artist’s “vocal style.”
For me, this question has been one of practical import—I’m still actively in product development mode in regard to the RhymeStep rap mechanic—but I’m also invested in this discussion as a Black artist who recognizes what is at stake for other creatives who continue to power hip-hop music and culture from the bottom.
Indeed, it’s no coincidence that rap music maintains such an outsized presence in these recent conversations about audio deepfakes. There is certainly an argument to be made that the formal elements of rap lyricism render it unique for architecting vocal synthesis tools, because rap songs typically provide more word and speech data per verse than do songs in other popular musical genres. As rap scholar Adam Bradley notes in the introduction to his Book of Rhymes: The Poetics of Hip-Hop, “There just aren’t enough words in a given pop lyric line to fill up a rap bar in the way we have come to expect.”
But to stop there is to miss an important opportunity to challenge a paradigm before it becomes an industry standard. As Morrison’s concept of Blacksound suggests, the same racist market dynamics that have made Kanye’s and JAY-Z’s deepfakes go viral cannot be understood outside of a longer historical arc that leads back to blackface minstrelsy as the starting point for understanding the shape of our music industry and our national identity. Perhaps sitting with this history means that before we use tools like Uberduck AI to share a neat or funny rhyme in the “voice” of a well-known rapper, we think about the stakes of amplifying the kind of ventriloquism encouraged on these platforms for future Black artists. In the context of text-driven rap generators like Song Lyrics Generator or Nabil Hassein’s MF DOOM generator, it also means that we should reflect on the biases that shape the kinds of words and bars that are served to us as neutral.
And from the developer side, it means asking how popular discourses around AI and rap composition might sound different if we were in conversation with the communities whose work fuels our innovations, rather than engaging with them after the fact, if at all. And it means asking how we can be more proactive in developing rap compositional tools that prioritize safety, and vitally, pleasure for those who are the most at risk of being erased within hip-hop culture and on the platforms built by big tech.1
For those of us who are invested in the future of hip-hop music and culture, it is important to start asking questions about what, if any role we want AI to play in the rap space.
I’ve gleaned some insights in the years since I started working on the RhymeStep rap mechanic at Glow Up Games. These ideas are rooted in what my Glow Up colleague and friend Latoya Peterson and I are calling BFGD: a framework of Black Feminist Game Design.
For us, applying a BFGD lens has primarily meant approaching the rhyme mechanic with the desire to center the needs of those most often pushed to the margins. This desire reflects the Black feminist political movements of the 1970s and 1980s, which identified the doubly marginal status frequently afforded Black women as recipients of both racism and sexism. And it has also meant taking up the work of hip-hop feminists like Dr. Joan Morgan who push for greater engagement with “the politics of pleasure” in Black feminist praxis.
We recognized at Glow Up that if we relied solely on the tools that already exist around rap synthesis, the experience we provided in the game would not be pleasurable for our core audience of Black millennial women and hip-hop heads across all age groups. So we decided to devise our own pronunciation library, crowdsourced from the vocabularies of our team at the studio and our broader digital communities. The database now includes thousands of common words and colloquialisms, subdivided into ten major rhyme families and further organized by a bespoke system of tags that denote categories of words that are relevant to our community (like words for a person’s aura or “swag”).
Creating this database has required difficult conversations about which features to discard because of the harm they might cause to marginalized communities, regardless of their perceived ludic benefits. For all the subversive power contained within the n-word—as it is used by Black people for naming friends and foes alike—we felt that it would be too easily weaponized by non-Black people if we included it in the game in any way. We also excised the ableist slur “lame,” a word used colloquially to mean “uncool” that appears frequently in hip-hop songs.
On the flip side, we decided to keep the words “bitch” and “ho” in the game after extended conversations about the stakes of this inclusion. To make this decision, we drew on the work of trap feminist Sesali Bowen, who has thoughtfully unpacked how the appropriation of these terms by women and queer rappers has been critical for expressing irreverence toward traditional heteronormative performances of gender, sexuality, and power. But to protect players against the potential weaponization of these words, we added a programmatic caveat: the words are only able to be used by players in rhymes where the player is speaking affirmatively about themselves, as in the phrase “I’m that bitch!”
Admittedly, our approach to creating a rap composition system is, in some ways, less flexible than many of the tools described here. Nevertheless, I believe that leading the design process with our non-negotiable safety guidelines has been necessary for the joy of our primary audience.
Ultimately, we reject the Silicon Valley ethos that we need to “move fast and break things” to serve our community. Similarly, we reject the notion that the solution for the problems created by unchecked techno-optimism is to grant technology even greater agency: create more nodes on the network, push for more “connectivity,” and loosen constraints around content creation and sharing.
Instead, an approach grounded in Black Feminist Game Design sees thoughtful constraints around language and performance as a source of productive tension. And this, in turn, can lead toward a more just and pleasurable world for folks on the margins, simply trying to devise sick new flows and fire ass rhymes.
This article was commissioned by Mona Sloane.
- I focus exclusively here on the uses of AI in rap vocalization, but there is much to be said about the integration of AI into music production tools like Boomy, which allow users to create music in the style of particular “genres” that are organized around a range of harmonic and instrumental probabilities. I would love to see further exploration of the idea that the algorithmic logics that power sites like TikTok are contributing to a substantive shift in the actual sound and songwriting structure of rap songs, as discussed in Emma Madden, “TikTok has broken rap music,” Wired, October 31, 2019. ↩