As the largest and most diverse collection of information in human history, the web grants us tremendous insight if we can only understand it better. Web crawl data can be used to spot trends and identify patterns in economics, health, politics, popular culture and many other aspects of life. It provides an immensely rich corpus for scientific research, technological advancement, and innovative new businesses. It is crucial for our information-based society that the web be openly accessible to anyone who desires to utilize it.
Common Crawl produces and maintains a repository of web crawl data that is openly accessible to everyone. The crawl currently covers 5 billion pages and includes valuable metadata. Small startups or even individuals can now access high quality crawl data that was previously only available to large search engine corporations.
In this session, Common Crawl Director, Lisa Green, will discuss the value of open crawl data, explain how the Common Crawl corpus can be accessed, and give examples of how the it is currently being used in research, education and business.
Lisa is motivated by a strong belief in the power of open systems to drive innovation in education, arts and research. Over the last several years she has been active in the areas of Open Access publishing, Open Science, Open Data, copyright, digital rights and policy.
Prior to joining Common Crawl, Lisa was Chief of Staff at Creative Commons, another non-profit organization that enables the sharing and use of creativity and knowledge through free legal tools.
Lisa holds a PhD in physical chemistry from the University of California Berkeley.