- Data Teaming
- Posts
- The VanguaRd
The VanguaRd
R's Extensible Community
“There was no real intention to build anything other than a toy to play around with ideas.”
Robert Gentleman
“Anything you like,” the cab driver replied. Ross Ihaka settled in the back seat en route to his new PhD program at University of California Berkeley. He considered the answer to his question about what went on in the city. The freedom of the “People’s Republic of Berkeley” during the late 1970’s and ‘80’s greatly influenced Ihaka. After several years at MIT, Ihaka found himself back at his home country of New Zealand for a professorship at the University of Auckland. By 1992, another statistician Robert Gentleman from University of Waterloo in Ontario, Canada had taken the 20 hour flight for a few months of research at Auckland. Their timing would prove to be serendipitous.
Gentleman and Ihaka frequently crossed paths, sharing a penchant for “playing academic fun and games with statistical programming languages.” Finally, Gentleman stopped Ihaka in the corridor with an invitation to write software together. The two huddled over the same computer, “one person typing, the other person looking over their shoulder at what they were doing, and criticizing, making suggestions.” According to Ihaka, this collaboration led to a "kind of mind meld where we could pretty much complete each other's sentences.” R and R, as they came to be known, had grown frustrated with the lack of data analysis programming tools for their Macintosh computer lab. The application Scheme had become unwieldy to write due to its complex syntax. Meanwhile, the language “S” had the syntax they wanted but lacked the Scheme-like interpreter. They set out to develop a new program using the S syntax but with improved memory management and the ability to create variables in functions locally rather than globally.
Lingua Franca
The two stood on giants’ shoulders. S had been developed at Bell Labs, where its creator, John Chambers, built upon John Tukey’s work on new systems for doing science with data. S’s founding principles sought to make it a “lingua franca” of scientific analysis, enabling greater collaboration and interdisciplinary efforts among academics. As Tukey said, “the future of data analysis can involve great progress, the overcoming of real difficulties, and the provision of a great service to all fields of science and technology.” To that end, Chambers wrote a series of books on how to develop the language, which the professors in Auckland used to build the programming language. They called it R, as both a nod to S and a reference to their first names. The next year, after a fellow Canadian professor complained about needing a Macintosh version of S, they released R to the public. As described in an article by Nick Thieme for Significance Magazine, they posted a usable version on StatLib, an online system for distributing statistical software and data, and let R out into the wild. As a sign of gratitude for Chambers’ contributions, R & R sent Chambers a compact disc of R version 1.0.
VanguaRd
Quickly more users got involved in building on the free software. Echoing the philosophies of Berkeley in its heyday, the team adopted something like a vanguard-ist approach to growth: a small, senior group takes on the custodial work required to advance the language. Bug fixes and change requests had grown too burdensome for the fledgling free software program so in 1997, they organized the “R Core” team, consisting of 11 users to approve changes to the base language. “The users were the developers in those days,” Ihaka has said. They didn’t stop there though. R-Core went on to build a repository known today as CRAN (the Comprehensive R Archive Network). CRAN became essential for sharing essential information and packages or code libraries. Tens of thousands of users began building packages and new features than the R-Core team could count. “The idea is you create software and they are free to modify and extend it in any way they like... [through a copyleft], meaning you can’t restrict what they do,” as Ihaka has noted.
R-Ladies and Gentleman
By the early 2000’s, R “grew like wildfire,” according to one co-founder of user group RLadies Erin LeDell. That was because “the community was so welcoming [and] supporting [of] women,” she said recently in an interview. Tareef Kawaf, President of the RStudio developer Posit, recently repeated this sentiment, noting the community is “very smart, very humble, very nice, very welcoming.” Today, estimates show RLadies has more than 40,000 members in more than 44 countries, further proving the R user community’s wide diversity. At the same time, Kawaf espouses the benefits of small teams of two to three people. At Posit, teams consist of “no more than six or seven loosely federated engineers.”
Tidy Team
As the R community has grown, its needs have evolved. Individual users originally flocked to R because of its interoperability and open library of packages. However, cleaning and preparing data made up an increasingly large part of data scientists’ jobs. Answering this call, Hadley Wickham of Posit rose to data science fame in the early aughts when he developed a set of packages called the tidyverse. One such package dplyr, which makes data munging simpler and has grown to over 1 million monthly downloads. Perhaps un-coincidentally, Wickham completed his Masters in statistics from none other than the University of Auckland in his home New Zealand, where Ihaka and Gentleman first met. You could say he is a descendant of Ihaka’s whakapapa or genealogy in Ihaka’s native Mauri language. Although not Mauri himself, Wickham has remained true to those statistical roots, renaming the latest version of R7 to “S7”, hearkening back to the original S language developed by Bell Labs.
While staying true to R’s roots, Posit continues to push the boundaries of R’s original principle of interoperability. Wickham has launched a project at Posit with the creator of the Python data science library “pandas” Wes McKinney to potentially merge R with Python using a new framework called Apache Arrow. As Wickham recently said, “With R, because you can combine things from different packages, [like packages dplyr or ggplot2, that leads to] fairly big impacts on the user experience and almost even how the community has to work together and form.” After all, Posit must differentiate itself by offering services to the community which has always used R for free.
The journey of R from a simple tool for academic exploration to a cornerstone of data science is a testament to the power of collaboration and open-source development. Its continued popularity and flexibility proves that small but diverse and global teams can make scientific analysis with data easier–even fun. As R-Core passes the torch to the next generation of developers, users, and commercial outfits like Posit, the spirit of R continues. Its community and reputation for practical problem-solving will continue to drive its evolution. Today, the hallmark of R remains its extensibility. What will become of commercial or non-profit efforts to expand the user community? How will Apache Arrow bring R and Python together? What other possibilities might the R user community unlock? The answer remains: anything you like.
The Virus that Gives Flight
In the gardening fields of Japan, cucumber mosaic virus (CMV) joins with a Y-satellite RNA or Y-Sat. The leaves first develop a mosaic pattern and then a yellow twinge. The mustard color attracts a swarm of aphids. They feed on the plants with only a marginal effect on photosynthesis. Biologists call it conditional mutualism. In small numbers, aphids enjoy the snack thanks to the parvovirus and move on.
When the aphids swarm in large numbers, the leaves’ shade turns to the color of roasted red clay. Then a minuscule miracle occurs: the Y-sat infected with CMV promotes wing formation. The virus gives the aphids the power to fly, enabling them to survive the wild beyond.
Reply