Zoe Oboler and Izzy Blair, "Comparing Artwork Metadata", link: https://zoeo11.github.io/
  File "<ipython-input-1-e9313e1360f1>", line 1
    Zoe Oboler and Izzy Blair, "Comparing Artwork Metadata", link: https://zoeo11.github.io/
        ^
SyntaxError: invalid syntax

Zoe Oboler and Izzy Blair, "Comparing Artwork Metadata", link:

Artwork Metadata is a collaborative effort between Izzy Blair and Zoe Oboler. We are using Google Colab, pandas, Python, and Matplotlib for visualization. Our dataset is a comprehensive spreadsheet of all the items currently in the collection of the Met Museum in New York City. This data provides information for more than 400,000 items in the collection, so you can imagine that it is a very large amount of data to go through (this is why we are not using Github because of their file size restrictions). Each row (representing one object) holds more than 43 columns of data, some of which include the department of the museum, the nationality of the artist as well as time period, the type of object, the date the object was recorded/collected, and more. We have compiled visualizations for the grouping of various objects, which will help us further in our predictive model.

Our model will use this data to predict the date of creation (i.e., "Object Begin Date") based off of other criteria. This model will be an especially helpful tool for predicting Begin Date for older objects, as many older objects in the database are lacking information.

Collaboration Plan: We will be working together at least once or twice a week (we have met up four times and collaborated virtually more times) in order to decide what metrics we want to display as well as how to go about visualizing them. Our original plan was to have metadata from the Met Museum. However, the data is so large that we have only had the capacity to work on the Met Museum data thus far. There are plenty of ways to analyze the data with just one dataset, however, as so much information is provided about each object. Some questions we hope to answer are: How do objects from different cultures compare to each other in terms of collection date, creation date, and object type? How do different object types differ in terms of size of collection and place of origin? What time period contains the most objects/most diverse set of artists?

%%shell
jupyter nbconvert --to html /PATH/TO/YOUR/NOTEBOOKFILE.ipynb
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv("/content/drive/MyDrive/MetObjects.csv")
df.head()
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-5-6c1a4f58638b> in <cell line: 1>()
----> 1 df = pd.read_csv("/content/drive/MyDrive/MetObjects.csv")
      2 df.head()

/usr/local/lib/python3.10/dist-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
    910     kwds.update(kwds_defaults)
    911 
--> 912     return _read(filepath_or_buffer, kwds)
    913 
    914 

/usr/local/lib/python3.10/dist-packages/pandas/io/parsers/readers.py in _read(filepath_or_buffer, kwds)
    575 
    576     # Create the parser.
--> 577     parser = TextFileReader(filepath_or_buffer, **kwds)
    578 
    579     if chunksize or iterator:

/usr/local/lib/python3.10/dist-packages/pandas/io/parsers/readers.py in __init__(self, f, engine, **kwds)
   1405 
   1406         self.handles: IOHandles | None = None
-> 1407         self._engine = self._make_engine(f, self.engine)
   1408 
   1409     def close(self) -> None:

/usr/local/lib/python3.10/dist-packages/pandas/io/parsers/readers.py in _make_engine(self, f, engine)
   1659                 if "b" not in mode:
   1660                     mode += "b"
-> 1661             self.handles = get_handle(
   1662                 f,
   1663                 mode,

/usr/local/lib/python3.10/dist-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    857         if ioargs.encoding and "b" not in ioargs.mode:
    858             # Encoding
--> 859             handle = open(
    860                 handle,
    861                 ioargs.mode,

FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/MetObjects.csv'
df = pd.read_csv("/content/drive/MyDrive/ArtMetaData/MetObjects.csv")
df.head()
<ipython-input-4-a5a233a04e20>:1: DtypeWarning: Columns (7,8,9,10,11,18,27,28,29,30,31,32,33,34,35,36,37,39) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv("/content/drive/MyDrive/ArtMetaData/MetObjects.csv")
Object Number Is Highlight Is Public Domain Object ID Department Object Name Title Culture Period Dynasty ... Subregion Locale Locus Excavation River Classification Rights and Reproduction Link Resource Metadata Date Repository
0 1979.486.1 False False 1 American Decorative Arts Coin One-dollar Liberty Head Coin NaN NaN NaN ... NaN NaN NaN NaN NaN Metal NaN http://www.metmuseum.org/art/collection/search/1 4/3/2017 8:00:08 AM Metropolitan Museum of Art, New York, NY
1 1980.264.5 False False 2 American Decorative Arts Coin Ten-dollar Liberty Head Coin NaN NaN NaN ... NaN NaN NaN NaN NaN Metal NaN http://www.metmuseum.org/art/collection/search/2 4/3/2017 8:00:08 AM Metropolitan Museum of Art, New York, NY
2 67.265.9 False False 3 American Decorative Arts Coin Two-and-a-Half Dollar Coin NaN NaN NaN ... NaN NaN NaN NaN NaN Metal NaN http://www.metmuseum.org/art/collection/search/3 4/3/2017 8:00:08 AM Metropolitan Museum of Art, New York, NY
3 67.265.10 False False 4 American Decorative Arts Coin Two-and-a-Half Dollar Coin NaN NaN NaN ... NaN NaN NaN NaN NaN Metal NaN http://www.metmuseum.org/art/collection/search/4 4/3/2017 8:00:08 AM Metropolitan Museum of Art, New York, NY
4 67.265.11 False False 5 American Decorative Arts Coin Two-and-a-Half Dollar Coin NaN NaN NaN ... NaN NaN NaN NaN NaN Metal NaN http://www.metmuseum.org/art/collection/search/5 4/3/2017 8:00:08 AM Metropolitan Museum of Art, New York, NY

5 rows × 43 columns

df.shape
(448203, 43)

As you can see below, there are a lot of columns of data. Many of these columns have a lot of NaN values, which we will clean up in order to make the data usable.

print(df.columns.tolist())
['Object Number', 'Is Highlight', 'Is Public Domain', 'Object ID', 'Department', 'Object Name', 'Title', 'Culture', 'Period', 'Dynasty', 'Reign', 'Portfolio', 'Artist Role', 'Artist Prefix', 'Artist Display Name', 'Artist Display Bio', 'Artist Suffix', 'Artist Alpha Sort', 'Artist Nationality', 'Artist Begin Date', 'Artist End Date', 'Object Date', 'Object Begin Date', 'Object End Date', 'Medium', 'Dimensions', 'Credit Line', 'Geography Type', 'City', 'State', 'County', 'Country', 'Region', 'Subregion', 'Locale', 'Locus', 'Excavation', 'River', 'Classification', 'Rights and Reproduction', 'Link Resource', 'Metadata Date', 'Repository']
df['Object Begin Date'].value_counts()
Object Begin Date
 1800    23579
 1700    16355
 1900    10583
 1600     7157
 1888     5936
         ...  
-2599        1
 1423        1
-1599        1
 646         1
 1017        1
Name: count, Length: 2074, dtype: int64
df['Classification'].value_counts()
Classification
Prints                                       69260
Prints|Ephemera                              30033
Photographs                                  26821
Drawings                                     25230
Books                                        14685
                                             ...  
Prints|Drawings|Books                            1
Books|Manuscripts|Ornament & Architecture        1
Ornament & Architecture|Books                    1
Albums|Books                                     1
Paper-Documents|Prints                           1
Name: count, Length: 1077, dtype: int64
df['Period'].value_counts()
Period
Edo period (1615–1868)                           8710
New Kingdom                                      6594
Middle Kingdom                                   4646
New Kingdom, Ramesside                           3758
Qing dynasty (1644–1911)                         3713
                                                 ... 
Late Archaic or Classical                           1
Imperial, Late Flavian–Hadrianic                    1
late Central or early Eastern Javanese period       1
Early Imperial, Augustan, probably                  1
Ramesside/Third Intermediate Period                 1
Name: count, Length: 1695, dtype: int64

The below cell transforms the date column as given, which has an unusable date format. Many objects have more than one year listed, as some objects took many years to make. In order to just have on year of usable data, we compute the average of all the years and then update the date column to reflect this change.

df2 = df
two_year_rows = df2[df2['Object Begin Date'] > 9999]

# Compute the average of the two years
two_year_rows['Object Begin Date'] = (two_year_rows['Object Begin Date'] // 10000 + two_year_rows['Object Begin Date'] % 10000) / 2

# Update the original DataFrame with the modified values
df.update(two_year_rows)
<ipython-input-11-980bb0026b24>:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  two_year_rows['Object Begin Date'] = (two_year_rows['Object Begin Date'] // 10000 + two_year_rows['Object Begin Date'] % 10000) / 2

Below, we filter our original dataframe to show the distribution of classification of objects where the MET has more than 2000 of such objects.

filtered_df = df['Classification'].value_counts()[df['Classification'].value_counts() > 2000]

# Plotting the bar chart for the filtered counts
filtered_df.plot(kind='bar')
plt.xlabel('Classification')
plt.ylabel('Count')
plt.title('Distribution of Classification (Counts > 5000)')
plt.show()

Distribution of Classification: There are far more types (classifications) of objects than shown here, but we chose to condense the objects displayed to those objects with more than 2000 pieces. We did this by filtering our dataframe into the value counts that are greater than 2000, and then used matplotlib to visualize the data into a bar graph.



# Group by 'Object Begin Date' (year) and count the number of prints
df3 = df2.loc[(df['Object Begin Date'] >= 0) & (df['Object Begin Date'] <= 2024)]
obj_per_year = df3.groupby('Object Begin Date').size()

# Plot the line graph
obj_per_year.plot(kind='line')
plt.xlabel('Year')
plt.ylabel('Number of Objects')
plt.title('Number of Objects Added per Year')
plt.grid(True)
plt.show()

Number of Objects Added per Year: We made this line plot by first creating a new dataframe and displaying the size of objects per year added. It is important to note that due to scale we filtered out years smaller than zero(bce) which is something we will need to address later in the project.

Number of Prints per Culture: This is a plot of a groupby dataframe of the previous print_df (just a df of all Prints) but now it is grouped by Culture. This is then plotted into a bar graph.

df['Culture'].value_counts()
Culture
American                       22167
French                         18224
Japan                          16374
China                          13844
Italian                         6580
                               ...  
German, Weimar                     1
American Eskimo (Alaska)           1
possibly German, Königsberg        1
Russian (Petrograd)                1
French, presumably Paris           1
Name: count, Length: 7101, dtype: int64
newfiltered_df = df['Culture'].value_counts()[df['Culture'].value_counts() > 500]

# Plotting the bar chart for the filtered counts
newfiltered_df.plot(kind='bar')
plt.xlabel('Culture')
plt.ylabel('Count')
plt.title('Distribution of Cultures (Counts > 500)')
plt.show()

Distribution of Cultures: Similarly to the Distribution of Classification graph, we graphed the different cultures and how many objects are recorded to be in the collection from each culture. (We only had space to show those cultures with more than 500 objects at the museum). It is interesting to compare this metric, which has the most objects from America, to the Prints culture distribution, where American prints takes up a much smaller amount.

newfiltered_df = df['Artist Display Name'].value_counts()[df['Artist Display Name'].value_counts() > 500]

# Plotting the bar chart for the filtered counts
newfiltered_df.plot(kind='bar')
plt.xlabel('Artist Display Name')
plt.ylabel('Count')
plt.title('Distribution of Artists(Counts > 500)')
plt.show()

We show here the distribution of artists by display name (as in the name on display at the MET museum).bold text

newfiltered_df = df['Period'].value_counts()[df['Period'].value_counts() > 500]

# Plotting the bar chart for the filtered counts
newfiltered_df.plot(kind='bar')
plt.xlabel('Period')
plt.ylabel('Count')
plt.title('Distribution of Period(Counts > 500)')
plt.show()

Here we display the distribution of period. Period is a broad generalization of time period, but this table gives us a good idea of what periods most objects come from.

Model Plan: For our model ideas, we use the location, culture, artist, and classification variables to determine the time/year that a piece of art was made. For our EDA graphs, we have shown graphs of year of creation based on varying factors such as region, culture, artist, etc. For our model, we have one model that determines time of creation based on the above factors INCLUDING information about the artist. For another model idea, we will make a model that predicts time of creation based on the above factors but EXCLUDING information about the artist. We want to do this because there are many pieces that were found without any access to the artist. This model could help predict time of creation of an object even without knowing who the artist is.

Below, we will be splitting the data into different eras. WE will be using these eras to predict possible predictions for Object Begin Date. We are splitting the data into eras in order to train the model, as there is so much data that our computers do not have sufficient RAM to process it all at once, even after splitting the data in other ways. We tried splitting the data randomly, but we found our models were not very predictive in those cases.

# create a new df called dfme with only the rows from the last 100 years based on df2["Object Begin Date"]

dfme = df2[df2['Object Begin Date'] >= 1922]
dfme.shape
(73015, 43)

We also drop NaN values so that our model can use the data to predict. We found that imputing values here skews the data, as most of the objects have begin dates closer to 1900-2000. So, in order to avoid this skewing, we simply dropped NaN values.

# drop dfme rows with NaN values in "Classification", "Culture"

dfme = dfme.dropna(subset=['Classification', 'Culture'])
dfme.shape
(2668, 43)
dfme1 = df2[(df2['Object Begin Date'] < 1922) & (df2['Object Begin Date'] >= 1822)]
dfme1 = dfme1.dropna(subset=['Classification', 'Culture'])
dfme2 = df2[(df2['Object Begin Date'] < 1821) & (df2['Object Begin Date'] >= 1750)]
dfme2 = dfme2.dropna(subset=['Classification', 'Culture'])

Below, we have our K-Nearest-Neighbors model to predict the Object Begin Date based off of Classification and Culture. We expect that this will not be a very predictive model, as culture is a broad generalization of an object (as you can see from our data above) so it will probably also generalize on predictions. Also, we are only predicting based off of the objects made after 1922. However, this is an important first step for our model. We will later test the accuracy of our model using k-fold cross validation.

from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

vec = DictVectorizer(sparse=False)
scaler = StandardScaler()
model = KNeighborsRegressor(n_neighbors=10)

features = ["Classification", "Culture" ]
pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])

X_train_dict = dfme[features].to_dict(orient="records")
y_train = dfme["Object Begin Date"]

vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)

scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)

model = KNeighborsRegressor(n_neighbors=10)
model.fit(X_train_sc, y_train)

y_train_pred = model.predict(X_train_sc)
y_train_pred

scores = cross_val_score(pipeline, X_train_dict, y_train,
                         cv=17, scoring="neg_mean_squared_error")

Below, we plot the MSE (mean squared error) for our model. We also plot a line that shows how our MSE changes as k increases. This tests the accuracy of our model and also gives us a value of k that has the minimum error. That k value here seems to be around six or seven.

vec = DictVectorizer(sparse=False)
scaler = StandardScaler()
pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])

def get_cv_error(k):
    model = KNeighborsRegressor(n_neighbors=k)
    pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
    mse = np.mean(-cross_val_score(
        pipeline, X_train_dict, y_train,
        cv=10, scoring="neg_mean_squared_error"
    ))
    return mse

ks = pd.Series(range(1, 51))
ks.index = range(1, 51)
test_errs = ks.apply(get_cv_error)

test_errs.plot.line()
test_errs.sort_values()
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-42-cd249b7be27e> in <cell line: 16>()
     14 ks = pd.Series(range(1, 51))
     15 ks.index = range(1, 51)
---> 16 test_errs = ks.apply(get_cv_error)
     17 
     18 test_errs.plot.line()

/usr/local/lib/python3.10/dist-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwargs)
   4628         dtype: float64
   4629         """
-> 4630         return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
   4631 
   4632     def _reduce(

/usr/local/lib/python3.10/dist-packages/pandas/core/apply.py in apply(self)
   1023 
   1024         # self.f is Callable
-> 1025         return self.apply_standard()
   1026 
   1027     def agg(self):

/usr/local/lib/python3.10/dist-packages/pandas/core/apply.py in apply_standard(self)
   1074             else:
   1075                 values = obj.astype(object)._values
-> 1076                 mapped = lib.map_infer(
   1077                     values,
   1078                     f,

/usr/local/lib/python3.10/dist-packages/pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()

<ipython-input-42-cd249b7be27e> in get_cv_error(k)
      6     model = KNeighborsRegressor(n_neighbors=k)
      7     pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
----> 8     mse = np.mean(-cross_val_score(
      9         pipeline, X_train_dict, y_train,
     10         cv=10, scoring="neg_mean_squared_error"

/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py in cross_val_score(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, error_score)
    513     scorer = check_scoring(estimator, scoring=scoring)
    514 
--> 515     cv_results = cross_validate(
    516         estimator=estimator,
    517         X=X,

/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, error_score)
    264     # independent, and that it is pickle-able.
    265     parallel = Parallel(n_jobs=n_jobs, verbose=verbose, pre_dispatch=pre_dispatch)
--> 266     results = parallel(
    267         delayed(_fit_and_score)(
    268             clone(estimator),

/usr/local/lib/python3.10/dist-packages/sklearn/utils/parallel.py in __call__(self, iterable)
     61             for delayed_func, args, kwargs in iterable
     62         )
---> 63         return super().__call__(iterable_with_config)
     64 
     65 

/usr/local/lib/python3.10/dist-packages/joblib/parallel.py in __call__(self, iterable)
   1916             output = self._get_sequential_output(iterable)
   1917             next(output)
-> 1918             return output if self.return_generator else list(output)
   1919 
   1920         # Let's create an ID that uniquely identifies the current call. If the

/usr/local/lib/python3.10/dist-packages/joblib/parallel.py in _get_sequential_output(self, iterable)
   1845                 self.n_dispatched_batches += 1
   1846                 self.n_dispatched_tasks += 1
-> 1847                 res = func(*args, **kwargs)
   1848                 self.n_completed_tasks += 1
   1849                 self.print_progress()

/usr/local/lib/python3.10/dist-packages/sklearn/utils/parallel.py in __call__(self, *args, **kwargs)
    121             config = {}
    122         with config_context(**config):
--> 123             return self.function(*args, **kwargs)

/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, split_progress, candidate_progress, error_score)
    706 
    707         fit_time = time.time() - start_time
--> 708         test_scores = _score(estimator, X_test, y_test, scorer, error_score)
    709         score_time = time.time() - start_time - fit_time
    710         if return_train_score:

/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py in _score(estimator, X_test, y_test, scorer, error_score)
    765             scores = scorer(estimator, X_test)
    766         else:
--> 767             scores = scorer(estimator, X_test, y_test)
    768     except Exception:
    769         if isinstance(scorer, _MultimetricScorer):

/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_scorer.py in __call__(self, estimator, *args, **kwargs)
    113             try:
    114                 if isinstance(scorer, _BaseScorer):
--> 115                     score = scorer._score(cached_call, estimator, *args, **kwargs)
    116                 else:
    117                     score = scorer(estimator, *args, **kwargs)

/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_scorer.py in _score(self, method_caller, estimator, X, y_true, sample_weight)
    274         """
    275 
--> 276         y_pred = method_caller(estimator, "predict", X)
    277         if sample_weight is not None:
    278             return self._sign * self._score_func(

/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_scorer.py in _cached_call(cache, estimator, method, *args, **kwargs)
     71     """Call estimator with method and args and kwargs."""
     72     if cache is None:
---> 73         return getattr(estimator, method)(*args, **kwargs)
     74 
     75     try:

/usr/local/lib/python3.10/dist-packages/sklearn/pipeline.py in predict(self, X, **predict_params)
    478         Xt = X
    479         for _, name, transform in self._iter(with_final=False):
--> 480             Xt = transform.transform(Xt)
    481         return self.steps[-1][1].predict(Xt, **predict_params)
    482 

/usr/local/lib/python3.10/dist-packages/sklearn/utils/_set_output.py in wrapped(self, X, *args, **kwargs)
    138     @wraps(f)
    139     def wrapped(self, X, *args, **kwargs):
--> 140         data_to_wrap = f(self, X, *args, **kwargs)
    141         if isinstance(data_to_wrap, tuple):
    142             # only wrap the first output for cross decomposition

/usr/local/lib/python3.10/dist-packages/sklearn/feature_extraction/_dict_vectorizer.py in transform(self, X)
    370             Feature vectors; always 2-d.
    371         """
--> 372         return self._transform(X, fitting=False)
    373 
    374     def get_feature_names_out(self, input_features=None):

/usr/local/lib/python3.10/dist-packages/sklearn/feature_extraction/_dict_vectorizer.py in _transform(self, X, fitting)
    278             result_matrix.sort_indices()
    279         else:
--> 280             result_matrix = result_matrix.toarray()
    281 
    282         if fitting:

/usr/local/lib/python3.10/dist-packages/scipy/sparse/_compressed.py in toarray(self, order, out)
   1048         if out is None and order is None:
   1049             order = self._swap('cf')[0]
-> 1050         out = self._process_toarray_args(order, out)
   1051         if not (out.flags.c_contiguous or out.flags.f_contiguous):
   1052             raise ValueError('Output array must be C or F contiguous')

/usr/local/lib/python3.10/dist-packages/scipy/sparse/_base.py in _process_toarray_args(self, order, out)
   1265             return out
   1266         else:
-> 1267             return np.zeros(self.shape, dtype=self.dtype, order=order)
   1268 
   1269     def _get_index_dtype(self, arrays=(), maxval=None, check_contents=False):

KeyboardInterrupt: 

Below, we plot a colormap of our first prediction model. As you can see, it is not very predictive, as we expected.

import matplotlib.pyplot as plt

pipeline.fit(X_train_dict, y_train)

predicted_values = pipeline.predict(X_train_dict)

top_cultures = dfme['Culture'].value_counts().nlargest(10).index

filtered_df = dfme[(dfme['Culture'].isin(top_cultures)) & (dfme['Object Begin Date'] > 0)]
filtered_df.reset_index(drop=True, inplace=True)


plt.figure(figsize=(10, 8))
plt.scatter(filtered_df['Culture'], filtered_df['Object Begin Date'], c=predicted_values[filtered_df.index], cmap='viridis')
plt.colorbar(label='Predicted Object Begin Date')
plt.xlabel('Culture')
plt.ylabel('Object Begin Date')
plt.title('Predicted Object Begin Date based on top 10 cultures')
plt.xticks(rotation=45, ha='right')
plt.grid(True)
plt.show()

Below we will do the same thing for a new era, using dataframe dfme1.

#for dfme1, new era
features = ["Classification", "Culture" ]
pipeline1 = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])

X_train_dict1 = dfme1[features].to_dict(orient="records")
y_train1 = dfme1["Object Begin Date"]

vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train1 = vec.transform(X_train_dict1)

scaler = StandardScaler()
scaler.fit(X_train1)
X_train_sc1 = scaler.transform(X_train1)

model = KNeighborsRegressor(n_neighbors=10)
model.fit(X_train_sc1, y_train1)

y_train_pred1 = model.predict(X_train_sc1)
y_train_pred1

scores1 = cross_val_score(pipeline1, X_train_dict1, y_train1,
                         cv=17, scoring="neg_mean_squared_error")

Again, we get the MSE for this prediction model to test the accuracy.

#dfme1
def get_cv_error1(k):
    model = KNeighborsRegressor(n_neighbors=k)
    pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
    mse = np.mean(-cross_val_score(
        pipeline, X_train_dict, y_train,
        cv=10, scoring="neg_mean_squared_error"
    ))
    return mse

ks = pd.Series(range(1, 51))
ks.index = range(1, 51)
test_errs = ks.apply(get_cv_error1)

test_errs.plot.line()
test_errs.sort_values()
#dfme1
pipeline1.fit(X_train_dict1, y_train1)

predicted_values1 = pipeline1.predict(X_train_dict1)

top_cultures1 = dfme1['Culture'].value_counts().nlargest(10).index

filtered_df1 = dfme1[(dfme1['Culture'].isin(top_cultures1)) & (dfme1['Object Begin Date'] > 0)]
filtered_df1.reset_index(drop=True, inplace=True)


plt.figure(figsize=(10, 8))
plt.scatter(filtered_df1['Culture'], filtered_df1['Object Begin Date'], c=predicted_values1[filtered_df1.index], cmap='viridis')
plt.colorbar(label='Predicted Object Begin Date')
plt.xlabel('Culture')
plt.ylabel('Object Begin Date')
plt.title('Predicted Object Begin Date based on top 10 cultures')
plt.xticks(rotation=45, ha='right')
plt.grid(True)
plt.show()

And again for the third era.

#for dfme2, new era
features = ["Classification", "Culture" ]
pipeline2 = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])

X_train_dict2 = dfme2[features].to_dict(orient="records")
y_train2 = dfme2["Object Begin Date"]

vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train2 = vec.transform(X_train_dict2)

scaler = StandardScaler()
scaler.fit(X_train2)
X_train_sc2 = scaler.transform(X_train2)

model = KNeighborsRegressor(n_neighbors=10)
model.fit(X_train_sc2, y_train2)

y_train_pred2 = model.predict(X_train_sc2)
y_train_pred2

scores2 = cross_val_score(pipeline2, X_train_dict2, y_train2,
                         cv=17, scoring="neg_mean_squared_error")
#dfme2
def get_cv_error2(k):
    model = KNeighborsRegressor(n_neighbors=k)
    pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
    mse = np.mean(-cross_val_score(
        pipeline, X_train_dict2, y_train2,
        cv=10, scoring="neg_mean_squared_error"
    ))
    return mse

ks = pd.Series(range(1, 51))
ks.index = range(1, 51)
test_errs = ks.apply(get_cv_error2)

test_errs.plot.line()
test_errs.sort_values()
#dfme2
pipeline.fit(X_train_dict2, y_train2)

predicted_values2 = pipeline2.predict(X_train_dict2)

top_cultures2 = dfme2['Culture'].value_counts().nlargest(10).index

filtered_df2 = dfme2[(dfme2['Culture'].isin(top_cultures2)) & (dfme2['Object Begin Date'] > 0)]
filtered_df2.reset_index(drop=True, inplace=True)


plt.figure(figsize=(10, 8))
plt.scatter(filtered_df2['Culture'], filtered_df2['Object Begin Date'], c=predicted_values2[filtered_df2.index], cmap='viridis')
plt.colorbar(label='Predicted Object Begin Date')
plt.xlabel('Culture')
plt.ylabel('Object Begin Date')
plt.title('Predicted Object Begin Date based on top 10 cultures')
plt.xticks(rotation=45, ha='right')
plt.grid(True)
plt.show()

Below, we will test the accuracy of a model with three criteria (Classification, Culture, and Medium) versus a model with the same criteria plus Artist Display Name. As we discussed above, we will look at these two models in comparison because many objects do not have an Artist Display Name, or any information about the artist at all.

features = ["Classification", "Culture", "Medium"]
X_dict = df2[features].to_dict(orient="records")
np.mean(
    -cross_val_score(pipeline, X_dict, y, cv=10, scoring="neg_mean_squared_error")
)
features = ["Classification", "Culture", "Medium", "Artist Display Name"]
X_dict = df2[features].to_dict(orient="records")
np.mean(
    -cross_val_score(pipeline, X_dict, y, cv=10, scoring="neg_mean_squared_error")
)

As you can see, a prediction with the Artist Name included had the MSE that was lower, meaning this is a more accurate prediction. This is to be expected, as more information about the object can make a more accurate prediction.

While this is helpful data point, for our project we think that both Models still have potential use cases. Found objects that are being identified may not have a specific artist, and neither do many of the older objects in the collection.

As part of our testing process, we chose an actual object that we already know information for. We found this object from this website: https://www.metmuseum.org/art/collection/search/16885

We will use the model to predict the Object Begin Date for this object and see how accurate the prediction is.

features = ["Classification", "Culture", "Medium", "Artist Display Name", "Region"]

X_train = pd.get_dummies(dfme[features])
y_train = dfme["Object Begin Date"]
X_train.columns

x_new = pd.Series(index=X_train.columns, dtype=float)

x_new["Classification_Drawings"] = 1
x_new["Culture_American"] = 1
x_new["Medium_Graphite"] = 1
x_new["Region_American"] = 1


x_new.fillna(0, inplace=True)

x_new

X_train_mean = X_train.mean()
X_train_std = X_train.std()

X_train_sc = (X_train - X_train_mean) / X_train_std
x_new_sc = (x_new - X_train_mean) / X_train_std

# Find index of 30 nearest neighbors.
dists = np.sqrt(((X_train_sc - x_new_sc) ** 2).sum(axis=1))
i_nearest = dists.sort_values()[:30].index

# Average the labels of these 30 nearest neighbors
y_train.loc[i_nearest].mean()

This model is therefore pretty accurate for this drawing. The museum website does not have a creation date for this object, but rather a date when it was gifted to the museum which was 1925. Therefore, the prediction is quite accurate to the data that we do have for this particular object.

Conclusions: Because our data is inundated with objects dated very recently, we had to create a predictive model that truly and accurately predicts dates without being skewed to more recent years. So, instead of imputing missing values to train our model, we dropped all NaN's in order to do this. We then used k-fold cross validation to test our model’s accuracy. We found that models that used more criteria (as in, models that predicted based on Culture, Classification, Artist Display Name, Region, and Medium rather than just Culture and Classification) were more accurate. This makes sense for two reasons: one being that there is more data to train on, and also that some of these categories are very broad. One object classified as “American” might be vastly different from another American object. So, we know now that the more criteria we have, the more accurately the model will predict the Object Begin Date. We were also selective about which criteria to use to train the model. For instance, we compared predictions with and without Artist Display Name because many objects have no information about the artist.

In conclusion, it is possible to predict the Object Begin Date of an object from the MET Museum using other criteria. However, you must be selective about which criteria you use. You must also be cautious about missing values so as not to skew the predictive data.